Fix/server prompt cache no consume on load by alainnothere · Pull Request #24143 · ggml-org/llama.cpp

alainnothere · 2026-06-04T19:55:03Z

Overview

Hi, I'm using llama.cpp to run Qwen3.6, and have multiple windows of my llm cli thing and ran into issues where the whole conversation had to be reload each time that I switched between the model cli sessions, I traced it to server_prompt_cache::load().

I added a regression test (tools/server/tests/unit/test_prompt_cache_reuse.py): three conversations, one slot, where conversation A lives only in the cache while a prefix-sharing conversation C is loaded. It fails on master (A's entry is consumed, so A only reuses the shared prefix on return) and passes with the fix (A's full entry survives and is reused).
Verified on a dense model — cache reuse on A's return goes from the shared-prefix length to the full prompt length.
I have been running the proposed fix for something like 15 days now with no issues.

On the technical side, server_prompt_cache::load() moved the matched entry out of states and freed its serialized KV bytes on every successful match. This is wrong for an alternating multi-conversation workload when you have more than one conversation and they share a common long prefix ( for xample in my case my cli system prompt), a request for conversation B can match conversation A's cached entry with f_keep >= 0.25 but < 1.0. load() restored A's KV into the slot and erased A's entry.

Then update_slots() often determines the partial match cannot be reused (divergence past the checkpoint window on SWA / hybrid / recurrent models, or an unsupported partial seq_rm) and resets to a full re-prefill of B. By that point A's entry is already gone, so when conversation A is next requested it also pays a full re-prefill. With clients that alternate between conversations this makes every switch a cold prefill.

Fix: restore the cached state into the live context without consuming the entry. Copy tokens + checkpoints into the slot's prompt and leave the cache entry in states. The serialized bytes are no longer cleared on the matched entry, so a later request for the same conversation can match it again. Entries remain bounded by update()'s existing size/token eviction; only the on-match deletion is removed.

prompt.data is intentionally left empty after restore: the KV bytes are now live in ctx_tgt/ctx_dft, and prompt_save() asserts prompt.data.size() == 0 before serializing a fresh save from the live context. server_tokens has a deleted copy constructor, so clone() is used for the token copy.

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes

server_prompt_cache::load() moved the matched entry out of `states` and freed its serialized KV bytes on every successful match. This is wrong for an alternating multi-conversation workload. When two conversations share a long common prefix (e.g. a shared system prompt), a request for conversation B can match conversation A's cached entry with f_keep >= 0.25 but < 1.0. load() restored A's KV into the slot and erased A's entry. update_slots() then often determines the partial match cannot be reused (divergence past the checkpoint window on SWA / hybrid / recurrent models, or an unsupported partial seq_rm) and resets to a full re-prefill of B. By that point A's entry is already gone, so when conversation A is next requested it also pays a full re-prefill. With clients that alternate between conversations this makes every switch a cold prefill. Fix: restore the cached state into the live context without consuming the entry. Copy tokens + checkpoints into the slot's prompt and leave the cache entry in `states`. The serialized bytes are no longer cleared on the matched entry, so a later request for the same conversation can match it again. Entries remain bounded by update()'s existing size/token eviction; only the on-match deletion is removed. prompt.data is intentionally left empty after restore: the KV bytes are now live in ctx_tgt/ctx_dft, and prompt_save() asserts prompt.data.size() == 0 before serializing a fresh save from the live context. server_tokens has a deleted copy constructor, so clone() is used for the token copy.

Regression test for the consume-on-load fix. With a single slot and three conversations, conversation A lives only in the prompt cache while a different conversation C — which shares A's prefix — is loaded. Before the fix, loading C erased A's cache entry; when A returns it can only reuse the A/C shared prefix and re-prefills the rest. After the fix A's entry survives and is reused in full. Verified against a dense model: the test passes with the fix (A fully reused) and fails without it (only the shared prefix reused). Adds a slot_prompt_similarity knob to the test server harness so the test can force slot reuse through the prompt cache (save + load) rather than the in-place LCP-similarity path, which would otherwise bypass load().

ggml-gh-bot · 2026-06-04T19:59:24Z

Hi @alainnothere, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

alainnothere · 2026-06-04T20:18:49Z

Hi @alainnothere, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
* **AI-generated content**: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

I disclosed used of AI in the PR text.

alainnothere added 2 commits June 4, 2026 13:57

alainnothere requested a review from a team as a code owner June 4, 2026 19:55

github-actions Bot added examples python python script changes server labels Jun 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/server prompt cache no consume on load#24143

Fix/server prompt cache no consume on load#24143
alainnothere wants to merge 2 commits into
ggml-org:masterfrom
alainnothere:fix/server-prompt-cache-no-consume-on-load

alainnothere commented Jun 4, 2026

Uh oh!

ggml-gh-bot Bot commented Jun 4, 2026

Uh oh!

alainnothere commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alainnothere commented Jun 4, 2026

Overview

Requirements

Uh oh!

ggml-gh-bot Bot commented Jun 4, 2026

Uh oh!

alainnothere commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant