Fix/server prompt cache no consume on load#24143
Open
alainnothere wants to merge 2 commits into
Open
Conversation
server_prompt_cache::load() moved the matched entry out of `states` and freed its serialized KV bytes on every successful match. This is wrong for an alternating multi-conversation workload. When two conversations share a long common prefix (e.g. a shared system prompt), a request for conversation B can match conversation A's cached entry with f_keep >= 0.25 but < 1.0. load() restored A's KV into the slot and erased A's entry. update_slots() then often determines the partial match cannot be reused (divergence past the checkpoint window on SWA / hybrid / recurrent models, or an unsupported partial seq_rm) and resets to a full re-prefill of B. By that point A's entry is already gone, so when conversation A is next requested it also pays a full re-prefill. With clients that alternate between conversations this makes every switch a cold prefill. Fix: restore the cached state into the live context without consuming the entry. Copy tokens + checkpoints into the slot's prompt and leave the cache entry in `states`. The serialized bytes are no longer cleared on the matched entry, so a later request for the same conversation can match it again. Entries remain bounded by update()'s existing size/token eviction; only the on-match deletion is removed. prompt.data is intentionally left empty after restore: the KV bytes are now live in ctx_tgt/ctx_dft, and prompt_save() asserts prompt.data.size() == 0 before serializing a fresh save from the live context. server_tokens has a deleted copy constructor, so clone() is used for the token copy.
Regression test for the consume-on-load fix. With a single slot and three conversations, conversation A lives only in the prompt cache while a different conversation C — which shares A's prefix — is loaded. Before the fix, loading C erased A's cache entry; when A returns it can only reuse the A/C shared prefix and re-prefills the rest. After the fix A's entry survives and is reused in full. Verified against a dense model: the test passes with the fix (A fully reused) and fails without it (only the shared prefix reused). Adds a slot_prompt_similarity knob to the test server harness so the test can force slot reuse through the prompt cache (save + load) rather than the in-place LCP-similarity path, which would otherwise bypass load().
|
Hi @alainnothere, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
I disclosed used of AI in the PR text. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Hi, I'm using llama.cpp to run Qwen3.6, and have multiple windows of my llm cli thing and ran into issues where the whole conversation had to be reload each time that I switched between the model cli sessions, I traced it to server_prompt_cache::load().
I added a regression test (tools/server/tests/unit/test_prompt_cache_reuse.py): three conversations, one slot, where conversation A lives only in the cache while a prefix-sharing conversation C is loaded. It fails on master (A's entry is consumed, so A only reuses the shared prefix on return) and passes with the fix (A's full entry survives and is reused).
Verified on a dense model — cache reuse on A's return goes from the shared-prefix length to the full prompt length.
I have been running the proposed fix for something like 15 days now with no issues.
On the technical side, server_prompt_cache::load() moved the matched entry out of
statesand freed its serialized KV bytes on every successful match. This is wrong for an alternating multi-conversation workload when you have more than one conversation and they share a common long prefix ( for xample in my case my cli system prompt), a request for conversation B can match conversation A's cached entry with f_keep >= 0.25 but < 1.0. load() restored A's KV into the slot and erased A's entry.Then update_slots() often determines the partial match cannot be reused (divergence past the checkpoint window on SWA / hybrid / recurrent models, or an unsupported partial seq_rm) and resets to a full re-prefill of B. By that point A's entry is already gone, so when conversation A is next requested it also pays a full re-prefill. With clients that alternate between conversations this makes every switch a cold prefill.
Fix: restore the cached state into the live context without consuming the entry. Copy tokens + checkpoints into the slot's prompt and leave the cache entry in
states. The serialized bytes are no longer cleared on the matched entry, so a later request for the same conversation can match it again. Entries remain bounded by update()'s existing size/token eviction; only the on-match deletion is removed.prompt.data is intentionally left empty after restore: the KV bytes are now live in ctx_tgt/ctx_dft, and prompt_save() asserts prompt.data.size() == 0 before serializing a fresh save from the live context. server_tokens has a deleted copy constructor, so clone() is used for the token copy.
Requirements