[PoC] server: support requantizing kv cache by wadealexc · Pull Request #24134 · ggml-org/llama.cpp

wadealexc · 2026-06-04T17:20:35Z

Overview

This PR:

Adds an HTTP endpoint POST /cache/requantize, which accepts new values for ctk and ctv to be applied to the server's current kvcache.
Adds a LLAMA_API llama_requantize_memory, which reads the state from the existing kvcache, then tears it down, creates a new one with the provided ctk/ctv, and restores the old kvcache (quantizing/dequantizing as appropriate).
Allows slots to be restored even if they use different ctk/ctv (included here because the conversion path I'm using is now on the slot restore path)

Note: This PR is incomplete. I'm submitting it because it's working for the default llama_kv_cache, and I am hoping for feedback before I spend time getting it working for the other kvcache implementations! (Namely, whether my approach seems directionally correct, and whether this is something llama.cpp actually wants.)

Edit: after reviewing the other kvcache implementations, I realized the work I already did on llama_kv_cache is already sufficient to support all other architectures (except recurrent, which doesn't support quantization). I still need to support attention rotation and draft models, but I've tested this on Qwen3 and Qwen3.5 now, and it works for both.

Additional information

Motivation

I want to be able to quantize my kvcache on the fly, so that my inference setup can start at high precision and step down as context fills, rather than sitting at low precision for the entire session.

Currently, the only way to achieve this is to unload+reload the entire model, which is slow and also requires re-processing the existing prompt. By exposing a /cache/requantize endpoint, I can unload+reload just the kvcache and avoid having to do prompt processing a second time.

I think my approach could be improved: it might be nice to be able to query current kvcache size before using this method, as well as estimate savings from quantizing (like a dry-run option?). I also think it'd be cool to have the kvcache quantize automatically at certain context thresholds (maybe via a CLI flag). Anyway, I didn't want to get ahead of myself 😄

Still Missing

This PR is incomplete. I'm still missing:

Support for Hadamard rotation (going from f16 -> !f16 only works with rotation disabled)
Draft model handling
Support for mtmd

I'd be happy to implement these (or make any other requested changes), but I wanted to make sure there's interest first before I go spend another week on it! Thank you 🙌

Future Work

GET /cache/usage: get a memory breakdown of the memory usage at the current quantization
"dry-run" option for /cache/requantize: estimate how much memory would be freed by quantizing to the input k/v types
CLI flag (--dynamic-cache): auto-quantize kvcache as needed when inference exceeds device capacity

Requirements

I have read and agree with the contributing guidelines: Yes!
AI usage disclosure: Yes, I used Qwen3.5-27B and Opus 4.7 to:
- Help me research kvcache architecture and server internals
- Help write the spec and implementation
- Help test the implementation

I reviewed, refactored, and edited all the code in this PR and accept full responsibility.

…or dequantize as needed - also expose current kvcache type name via GET /props

- remove unreachable v_trans branch

jukofyork · 2026-06-04T21:36:15Z

From the Reddit post:

A CLI flag like --fit that enables dynamic kvcache quantization without needing to call an API endpoint from your inference harness. This would give you as much context as you can fit on your device, but when you approach the limits of your device, it quantizes your kvcache automatically.

I can't find it now, but I recently read either a paper (or possibly a blog post by a cloud provider) where they found that quantisation (of models weights though IIRC) was a lot less damaging for (batched) prompt processing than it was for (autoregressive) token generation, and they concluded it could be worthwhile serving a different quant for each (hence why it might have been a blog post by a cloud provider).

This makes you wonder if the same is true for KV-cache quantisation and some kind of "multistage cache" could be helpful?

wadealexc · 2026-06-05T13:11:13Z

This makes you wonder if the same is true for KV-cache quantisation and some kind of "multistage cache" could be helpful?

Interesting; I don't know about batched processing, but I do think there's some room for experimentation with maybe selectively quantizing portions of the cache? Like maybe you want earlier parts of a session to be more quantized, with more precision aimed at newer tokens?

- refac: rename endpoint to /cache/requantize

wadealexc added 3 commits June 4, 2026 12:22

feat(llama-server): when restoring from slot, automatically quantize …

f00ba48

…or dequantize as needed - also expose current kvcache type name via GET /props

feat(llama-server): add POST /requantize_kvcache endpoint

0d33c07

refactor: clean up implementation

477f2a3

- remove unreachable v_trans branch

wadealexc requested review from a team and ggerganov as code owners June 4, 2026 17:20

github-actions Bot added examples server labels Jun 4, 2026

fix: change architecture check to allow all but recurrent

ac75b53

- refac: rename endpoint to /cache/requantize

wadealexc changed the title ~~[PoC] server: support requantizing kv cache~~ server: support requantizing kv cache Jun 5, 2026

wadealexc changed the title ~~server: support requantizing kv cache~~ [PoC] server: support requantizing kv cache Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] server: support requantizing kv cache#24134

[PoC] server: support requantizing kv cache#24134
wadealexc wants to merge 4 commits into
ggml-org:masterfrom
wadealexc:poc-requantize-kvcache

wadealexc commented Jun 4, 2026 •

edited

Loading

Uh oh!

jukofyork commented Jun 4, 2026

Uh oh!

wadealexc commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wadealexc commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Motivation

Still Missing

Future Work

Requirements

Uh oh!

jukofyork commented Jun 4, 2026

Uh oh!

wadealexc commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wadealexc commented Jun 4, 2026 •

edited

Loading