fix: prevent KV cache corruption on SWA/ISWA models + hot-path perf#2180
Open
avion23 wants to merge 2 commits intoabetlen:mainfrom
Open
fix: prevent KV cache corruption on SWA/ISWA models + hot-path perf#2180avion23 wants to merge 2 commits intoabetlen:mainfrom
avion23 wants to merge 2 commits intoabetlen:mainfrom
Conversation
939fa72 to
9609c82
Compare
added 2 commits
April 13, 2026 12:53
… cache corruption - set_batch(): numpy bulk writes replace per-token Python loop - _create_completion: incremental token_to_piece() accumulation replaces O(n²) re-detokenization per generated token - _create_completion: in-place logit_bias instead of full vocab copy - _create_completion: np.argpartition for top-k logprobs (O(V) vs O(V log V)) - reset(): call llama_memory_clear() for proper KV cache state reset - generate(): bypass prefix-match for recurrent/SWA models - generate(): fix tokens[:-1] off-by-one in prefix matching - eval(): remove unconditional kv_cache_seq_rm, simplify logits assignment - token_to_piece(): return correct byte length via actual write count
c9bbd6d to
3538232
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Gemma-4 and any model with interleaved sliding window attention (ISWA) crashes on the second call to
create_chat_completion:Root cause
ISWA KV caches store position tracking in global maps (
g_iswa_pos_max/g_iswa_pos_min) that are cleared byllama_memory_clear()but not byllama_memory_seq_rm(). Thegenerate()method detects a prefix match between consecutive prompts (shared BOS token), callskv_cache_seq_rm()to remove the divergent tail, sees it returnTrue, and skips the full reset. Stale position maps cause batch allocator inconsistency →llama_decodereturns -1.Additionally,
reset()was a no-op on KV cache state (only resetn_tokens), so callingllm.reset()between prompts had no effect.A separate
tokens[:-1]off-by-one in prefix matching silently broke prefix detection for all models.Fix
reset()callsllama_memory_clear()unconditionally — proper KV cache state resetgenerate()setslongest_prefix = 0for recurrent/SWA models when the prefix doesn't cover all cached tokens, falling through to full resettokens[:-1]→tokensin prefix matchingeval()simplified: removed dead code paths, direct 2D logits assignmentPerformance
set_batch()_create_completiondetokenizationdetokenize(completion_tokens)token_to_piece()_create_completionlogit_biasnp.copy(scores)every token_create_completiontop-k logprobssorted()on 128K vocab: O(V log V)np.argpartition: O(V)eval()logitstoken_to_piece()create_string_buffer(32)with trailing garbageBenchmarks (TinyLlama-1.1B, M1 Pro, 10-50 runs)
Testing
Functional verification: ISWA crash fix, reset(), prefix matching, logit_bias, set_batch, eval(), token_to_piece, argpartition correctness — all pass.