feat(llm-access-kiro): spread new sessions + band latency for fair account scheduling#25
Conversation
…count scheduling Two scheduling fairness fixes in the Kiro account selector: 1. Latency banding. selection_ordered_kiro_routes compared exact smoothed latency scores, so any account measured marginally faster than the global average permanently out-sorted accounts that merely lacked samples (which fall back to the global average). The last_started_at round-robin tier almost never fired and the fastest account won everything. route_score_ms is replaced by route_score_band, which quantizes the score into bands of 15% of the global average; accounts in the same band tie and fall through to round-robin. Missing samples are no longer a penalty. 2. New-session spread. A new session (has a session id but no existing affinity) followed pure latency and stacked onto the same fastest account. KiroSessionAffinity now exposes account_session_counts(); when dispatching a new session, the gateway selects the account with the fewest bound sessions first (least-connections), with latency band as the tiebreaker. Existing affinity stays sticky; stateless requests keep the band-only ordering. Wired into both the main and websearch dispatch paths. Self-balancing: counts diverge as sessions bind, so subsequent new sessions flow to the laggards. account_session_counts is a once-per-request snapshot (stable across failover retries) computed by iterating the affinity LRU (iter does not perturb recency), skipping TTL-expired entries. Tests: per-account session tally incl. expired-entry skip; new-session spread to least-bound account; same-band accounts round-robin instead of starving the no-sample one. All 295 llm-access tests pass; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces latency-score bands and session-balancing to prevent starving accounts with fewer samples and to distribute new sessions more evenly. Specifically, it quantizes latency scores into bands using a new BAND_FRACTION constant and adds session-affinity tracking to count active sessions per account, prioritizing accounts with fewer bound sessions during route selection. Feedback on these changes highlights a performance and lock contention hazard in account_session_counts_at, where string allocations are performed on every iteration while holding a mutex lock; using a borrowed lookup can optimize this path.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| let mut counts: HashMap<String, usize> = HashMap::new(); | ||
| for (_, entry) in entries.iter() { | ||
| if now.saturating_duration_since(entry.updated_at) > self.ttl { | ||
| continue; | ||
| } | ||
| *counts.entry(entry.account_name.to_string()).or_insert(0) += 1; | ||
| } |
There was a problem hiding this comment.
Performance & Lock Contention Hazard
Calling entry.account_name.to_string() on every single non-expired entry in the LRU cache (which can hold up to 65,536 entries) allocates a new String on every iteration, even if the account is already present in the counts map.
This is highly inefficient and holds the global entries mutex lock for much longer than necessary, causing severe lock contention on the hot path of every new session request.
By using counts.get_mut(entry.account_name.as_ref()), we can query the map using a borrowed &str without allocating. We only allocate a String when inserting a new account name for the first time, which reduces the number of allocations from O(N) to O(A) (where A is the number of unique accounts, typically very small).
let mut counts: HashMap<String, usize> = HashMap::new();
for (_, entry) in entries.iter() {
if now.saturating_duration_since(entry.updated_at) > self.ttl {
continue;
}
if let Some(count) = counts.get_mut(entry.account_name.as_ref()) {
*count += 1;
} else {
counts.insert(entry.account_name.to_string(), 1);
}
}References
- Avoid allocating temporary collections (such as
Vec<String>) and cloning elements solely for hashing. Instead, hash borrowed segments directly by chaining iterators or updating the hasher incrementally, especially on performance-critical hot paths.
There was a problem hiding this comment.
Good catch — adopted in ac98e5e. account_session_counts_at now borrows the name for the lookup (get_mut(entry.account_name.as_ref())) and only allocates a String when first inserting an account, so allocations drop from O(entries) to O(distinct accounts) and the affinity mutex is held for much less time on the new-session hot path.
I also tightened the same hot path further: the per-account scan is now skipped entirely when a key has a single candidate account (route_strategy=fixed) via a routes.len() > 1 guard at both dispatch sites, since there is nothing to spread there. Behavior is unchanged.
…le-account keys Address PR review on account_session_counts: - account_session_counts_at allocated a String for every non-expired LRU entry (up to 65,536) while holding the affinity mutex, even for already-counted accounts. Borrow the name for the lookup and only allocate when first inserting an account, dropping allocations from O(entries) to O(distinct accounts) and shortening the lock hold on the new-session hot path. - Skip the per-account scan entirely when a key has a single candidate account (route_strategy=fixed): there is nothing to spread, so the O(N) lock-held walk is pure overhead. Guarded with routes.len() > 1 at both dispatch paths. No behavior change: single-route ordering is invariant, and the count map is identical regardless of insertion path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fold Replace the imperative for-loop + pre-declared map with a filter/fold chain: TTL-expired entries are dropped by `filter`, and the `HashMap` is `fold`'s accumulator threaded to the return value (no separate `let mut` / trailing return). Allocation profile is unchanged — still a borrowed `get_mut` lookup that only allocates a key for first-seen accounts (O(distinct accounts)) — because std `HashMap::entry` requires an owned key and the borrowed-key entry API (`raw_entry_mut`/`entry_ref`) is not on stable. Threading the map by value through `fold` is a register-sized header move, so codegen matches the loop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5138d99f9b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| session_count: session_counts | ||
| .map(|counts| counts.get(&route.account_name).copied().unwrap_or(0)), |
There was a problem hiding this comment.
Aggregate session counts by routing identity
When a route list contains multiple account names for the same upstream Kiro user, this new-session key counts only the individual account_name, while the scheduler, cooldowns, and quota exhaustion operate on route.routing_identity (the next line already reads last-started by identity, and quota failover groups aliases by identity). In a set like alpha/beta sharing one routing identity plus gamma on another, existing sessions on alpha do not raise beta's count, so new sessions can keep being steered to the same upstream account through its alias instead of the actually less-bound identity. Aggregate the affinity counts over the candidate routes' routing identities before applying the fewest-sessions sort key.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Agreed, real bug — fixed in da51954. selection_ordered_kiro_routes now folds the per-account affinity counts onto the candidate routes' routing_identity before applying the fewest-sessions key, so aliases of one upstream account share a single load figure — matching how the scheduler, cooldowns, last-started, and quota failover already group. O(routes), keys borrow from routes (zero-alloc). Added kiro_selection_spreads_new_session_by_routing_identity_not_account_name: with alpha/beta on one identity (alpha=3 sessions) and gamma on another, a new session now goes to gamma instead of beta (which would win on per-account counts as the high-credit alias of the loaded identity).
| let band_width = (snapshot.global_avg_first_token_ms * BAND_FRACTION).max(1.0); | ||
| Some((score / band_width).floor() as i64) |
There was a problem hiding this comment.
Anchor latency bands around the neutral score
Because the band index is floor(score / width) with buckets anchored at zero, the neutral no-sample score (global_avg) often sits close to a bucket boundary, so accounts that are still within the configured 15% tolerance can land in a lower band and continue to win before round-robin. For example, with a 500 ms global average the width is 75 ms: a no-sample route is band 6, while a sampled 449 ms route (only ~10% faster) is band 5 and will still permanently sort ahead until its measured latency crosses an arbitrary boundary. If the intent is to treat near-global/no-sample accounts as equal, compute the band relative to the global baseline or otherwise center the neutral score inside the same tolerance bucket.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in da51954. Bands are now centered on the neutral baseline: round((score - global_avg) / (2 * tolerance)) with tolerance = global_avg * BAND_FRACTION, so band 0 is global_avg ± tolerance and the no-sample score sits dead-center instead of on a bucket boundary. Your 500ms/449ms example now shares band 0. Reworked the same-band test to a 430ms route score (~14% under the 500ms average): the old floor-from-zero scheme bucketed it at band 5 vs neutral's band 6 (permanent win), the centered scheme keeps both in band 0 so round-robin takes over. BAND_FRACTION is now documented as the tolerance radius.
| self.entries | ||
| .lock() | ||
| .expect("kiro session affinity mutex") | ||
| .iter() | ||
| .filter(|(_, entry)| now.saturating_duration_since(entry.updated_at) <= self.ttl) | ||
| .fold(HashMap::new(), |mut counts, (_, entry)| { |
There was a problem hiding this comment.
Avoid scanning affinity cache on every new session
For every brand-new session with multiple routes, this walks the entire session-affinity LRU while holding the synchronous mutex; with the configured limit raised toward LLM_ACCESS_KIRO_SESSION_AFFINITY_MAX_ENTRIES (up to 65,536 entries), a burst of new Kiro conversations can spend request-handler time serially scanning old sessions before route selection. This is avoidable by maintaining per-account counters or a smaller per-key/route snapshot instead of rebuilding counts from all cached entries on each request.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Considered, but keeping the point-in-time scan for now — I think the cost is bounded and cold-path:
- It runs only on the first request of a brand-new session: subsequent turns in that conversation hit affinity and take the sticky fast path, which never scans. And it's already skipped entirely for single-account keys (
routes.len() > 1guard). - At the default cap (4096) that's a tens-of-µs walk under the lock, once per new conversation, with allocation already cut to O(distinct accounts).
A maintained per-account/identity counter would have to stay consistent across LRU's implicit capacity eviction and lazy TTL expiry (entries are only reclaimed on lookup), which is materially more error-prone than rebuilding a fold each time. The scan only becomes a hot-path concern if someone raises LLM_ACCESS_KIRO_SESSION_AFFINITY_MAX_ENTRIES toward 65,536 — at which point I'd revisit with a maintained counter backed by profiling. Flagging this as a conscious trade-off rather than an oversight.
…atency bands Address two P2 findings from Codex review on the account scheduler: 1. Aggregate new-session spread by routing identity (not account name). The scheduler, cooldowns, last-started, and quota failover all group by `routing_identity`, but the fewest-sessions key counted per `account_name`. When several account names alias one upstream Kiro user, sessions bound to one alias did not deter new sessions from another alias of the same upstream account. selection_ordered_kiro_routes now folds the affinity counts onto the candidate routes' routing identities before applying the key (O(routes), keys borrow from routes). 2. Center latency bands on the neutral baseline. Bands were floor(score/width) anchored at zero, so the no-sample score (global_avg) sat near a bucket boundary: a sampled route only ~10% faster could land a band lower and permanently out-sort the neutral score before round-robin. Bands are now round((score - global_avg) / (2*tolerance)), so band 0 is global_avg ± (global_avg * BAND_FRACTION) and near-average / no-sample accounts share it. BAND_FRACTION is now the tolerance radius. Tests: new kiro_selection_spreads_new_session_by_routing_identity_not_account_name (gamma's less-bound identity wins over a high-credit alias of the loaded identity); same-band test reworked to a 430ms route score (~14% under average) that the old floor scheme bucketed away from neutral but the centered scheme keeps in band 0. 296 llm-access tests pass; clippy -D warnings clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
Two fairness defects in the Kiro account selector (
selection_ordered_kiro_routes):No-sample accounts get starved. Selection compared exact smoothed latency scores as the second key. An account without latency samples falls back to the global average; any account measured even marginally faster than the average permanently out-sorts it. The
last_started_atround-robin tier almost never fired, so the single fastest account won every request. "No sample" is not "high latency" — but the old ordering treated it that way.New sessions all stack on one account. A new session (has a session id but no prior affinity) followed pure latency and was pinned to the same fastest account. Per-session affinity itself was correct (same session → same account for 6h), but additional new sessions never spread — they all landed on that one account too.
Both stem from the same root: latency was too dominant, producing greedy winner-take-all.
Changes
1. Latency banding (
kiro_latency.rs)route_score_mswithroute_score_band: quantize the smoothed score into bands of15%of the global average (BAND_FRACTION). Accounts in the same band tie and fall through to the round-robin / balance tiers. A small latency edge no longer pins traffic.PreparedKiroLatencySnapshot::route_score(shared); the three guards (latency_routing_enabled, snapshot staleness, finite global avg) are preserved so legacy-order fallbacks are unchanged.2. New-session spread (
kiro_session_affinity.rs,route_selection.rs,kiro_dispatch.rs)KiroSessionAffinity::account_session_counts()tallies non-expired sessions per account by iterating the LRU (iterdoes not perturb recency), skipping TTL-expired entries.selection_ordered_kiro_routes/select_kiro_route_with_account_permittake an optionalsession_counts. When present (new session only), fewest bound sessions is the primary balance key, with latency band as tiebreaker.Resulting sort order
proxy health → fewest bound sessions → latency band → last_started → credits → nameproxy health → latency band → last_started → credits → name(unchanged shape, exact-score → band is the only diff)Existing affinity stays sticky; the preferred-account fast path is untouched.
Notes
session_countsis a point-in-time snapshot; it does not react to sessions bound by concurrent in-flight requests between retries — acceptable for fairness.Tests
account_session_counts_tallies_live_entries_per_account+..._skips_expired_entrieskiro_selection_spreads_new_session_to_least_bound_account(slower account with fewer sessions wins for a new session)kiro_selection_treats_same_band_accounts_as_equal_for_round_robin(no-sample account not starved by marginal edge)cargo test -p llm-access→ 295 passed.cargo clippy -p llm-access -- -D warnings→ clean.rustfmton changed files only.🤖 Generated with Claude Code