Skip to content

Fix/server prompt cache no consume on load#24143

Open
alainnothere wants to merge 2 commits into
ggml-org:masterfrom
alainnothere:fix/server-prompt-cache-no-consume-on-load
Open

Fix/server prompt cache no consume on load#24143
alainnothere wants to merge 2 commits into
ggml-org:masterfrom
alainnothere:fix/server-prompt-cache-no-consume-on-load

Conversation

@alainnothere
Copy link
Copy Markdown

Overview

Hi, I'm using llama.cpp to run Qwen3.6, and have multiple windows of my llm cli thing and ran into issues where the whole conversation had to be reload each time that I switched between the model cli sessions, I traced it to server_prompt_cache::load().

I added a regression test (tools/server/tests/unit/test_prompt_cache_reuse.py): three conversations, one slot, where conversation A lives only in the cache while a prefix-sharing conversation C is loaded. It fails on master (A's entry is consumed, so A only reuses the shared prefix on return) and passes with the fix (A's full entry survives and is reused).
Verified on a dense model — cache reuse on A's return goes from the shared-prefix length to the full prompt length.
I have been running the proposed fix for something like 15 days now with no issues.

On the technical side, server_prompt_cache::load() moved the matched entry out of states and freed its serialized KV bytes on every successful match. This is wrong for an alternating multi-conversation workload when you have more than one conversation and they share a common long prefix ( for xample in my case my cli system prompt), a request for conversation B can match conversation A's cached entry with f_keep >= 0.25 but < 1.0. load() restored A's KV into the slot and erased A's entry.

Then update_slots() often determines the partial match cannot be reused (divergence past the checkpoint window on SWA / hybrid / recurrent models, or an unsupported partial seq_rm) and resets to a full re-prefill of B. By that point A's entry is already gone, so when conversation A is next requested it also pays a full re-prefill. With clients that alternate between conversations this makes every switch a cold prefill.

Fix: restore the cached state into the live context without consuming the entry. Copy tokens + checkpoints into the slot's prompt and leave the cache entry in states. The serialized bytes are no longer cleared on the matched entry, so a later request for the same conversation can match it again. Entries remain bounded by update()'s existing size/token eviction; only the on-match deletion is removed.

prompt.data is intentionally left empty after restore: the KV bytes are now live in ctx_tgt/ctx_dft, and prompt_save() asserts prompt.data.size() == 0 before serializing a fresh save from the live context. server_tokens has a deleted copy constructor, so clone() is used for the token copy.

Requirements

server_prompt_cache::load() moved the matched entry out of `states` and
freed its serialized KV bytes on every successful match. This is wrong
for an alternating multi-conversation workload.

When two conversations share a long common prefix (e.g. a shared system
prompt), a request for conversation B can match conversation A's cached
entry with f_keep >= 0.25 but < 1.0. load() restored A's KV into the
slot and erased A's entry. update_slots() then often determines the
partial match cannot be reused (divergence past the checkpoint window
on SWA / hybrid / recurrent models, or an unsupported partial seq_rm)
and resets to a full re-prefill of B. By that point A's entry is already
gone, so when conversation A is next requested it also pays a full
re-prefill. With clients that alternate between conversations this makes
every switch a cold prefill.

Fix: restore the cached state into the live context without consuming
the entry. Copy tokens + checkpoints into the slot's prompt and leave
the cache entry in `states`. The serialized bytes are no longer cleared
on the matched entry, so a later request for the same conversation can
match it again. Entries remain bounded by update()'s existing
size/token eviction; only the on-match deletion is removed.

prompt.data is intentionally left empty after restore: the KV bytes are
now live in ctx_tgt/ctx_dft, and prompt_save() asserts
prompt.data.size() == 0 before serializing a fresh save from the live
context. server_tokens has a deleted copy constructor, so clone() is
used for the token copy.
Regression test for the consume-on-load fix. With a single slot and three
conversations, conversation A lives only in the prompt cache while a
different conversation C — which shares A's prefix — is loaded. Before the
fix, loading C erased A's cache entry; when A returns it can only reuse the
A/C shared prefix and re-prefills the rest. After the fix A's entry
survives and is reused in full.

Verified against a dense model: the test passes with the fix (A fully
reused) and fails without it (only the shared prefix reused).

Adds a slot_prompt_similarity knob to the test server harness so the test
can force slot reuse through the prompt cache (save + load) rather than the
in-place LCP-similarity path, which would otherwise bypass load().
@alainnothere alainnothere requested a review from a team as a code owner June 4, 2026 19:55
@github-actions github-actions Bot added examples python python script changes server labels Jun 4, 2026
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Jun 4, 2026

Hi @alainnothere, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@alainnothere
Copy link
Copy Markdown
Author

Hi @alainnothere, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

* **AI-generated content**: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

I disclosed used of AI in the PR text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant