Skip to content

[PoC] server: support requantizing kv cache#24134

Open
wadealexc wants to merge 4 commits into
ggml-org:masterfrom
wadealexc:poc-requantize-kvcache
Open

[PoC] server: support requantizing kv cache#24134
wadealexc wants to merge 4 commits into
ggml-org:masterfrom
wadealexc:poc-requantize-kvcache

Conversation

@wadealexc
Copy link
Copy Markdown

@wadealexc wadealexc commented Jun 4, 2026

Overview

This PR:

  • Adds an HTTP endpoint POST /cache/requantize, which accepts new values for ctk and ctv to be applied to the server's current kvcache.
  • Adds a LLAMA_API llama_requantize_memory, which reads the state from the existing kvcache, then tears it down, creates a new one with the provided ctk/ctv, and restores the old kvcache (quantizing/dequantizing as appropriate).
  • Allows slots to be restored even if they use different ctk/ctv (included here because the conversion path I'm using is now on the slot restore path)

Note: This PR is incomplete. I'm submitting it because it's working for the default llama_kv_cache, and I am hoping for feedback before I spend time getting it working for the other kvcache implementations! (Namely, whether my approach seems directionally correct, and whether this is something llama.cpp actually wants.)

Edit: after reviewing the other kvcache implementations, I realized the work I already did on llama_kv_cache is already sufficient to support all other architectures (except recurrent, which doesn't support quantization). I still need to support attention rotation and draft models, but I've tested this on Qwen3 and Qwen3.5 now, and it works for both.

Additional information

Motivation

I want to be able to quantize my kvcache on the fly, so that my inference setup can start at high precision and step down as context fills, rather than sitting at low precision for the entire session.

Currently, the only way to achieve this is to unload+reload the entire model, which is slow and also requires re-processing the existing prompt. By exposing a /cache/requantize endpoint, I can unload+reload just the kvcache and avoid having to do prompt processing a second time.

I think my approach could be improved: it might be nice to be able to query current kvcache size before using this method, as well as estimate savings from quantizing (like a dry-run option?). I also think it'd be cool to have the kvcache quantize automatically at certain context thresholds (maybe via a CLI flag). Anyway, I didn't want to get ahead of myself 😄

Still Missing

This PR is incomplete. I'm still missing:

  • Support for Hadamard rotation (going from f16 -> !f16 only works with rotation disabled)
  • Draft model handling
  • Support for mtmd

I'd be happy to implement these (or make any other requested changes), but I wanted to make sure there's interest first before I go spend another week on it! Thank you 🙌

Future Work

  • GET /cache/usage: get a memory breakdown of the memory usage at the current quantization
  • "dry-run" option for /cache/requantize: estimate how much memory would be freed by quantizing to the input k/v types
  • CLI flag (--dynamic-cache): auto-quantize kvcache as needed when inference exceeds device capacity

Requirements

  • I have read and agree with the contributing guidelines: Yes!
  • AI usage disclosure: Yes, I used Qwen3.5-27B and Opus 4.7 to:
    • Help me research kvcache architecture and server internals
    • Help write the spec and implementation
    • Help test the implementation

I reviewed, refactored, and edited all the code in this PR and accept full responsibility.

wadealexc added 3 commits June 4, 2026 12:22
…or dequantize as needed

- also expose current kvcache type name via GET /props
- remove unreachable v_trans branch
@jukofyork
Copy link
Copy Markdown
Collaborator

From the Reddit post:

A CLI flag like --fit that enables dynamic kvcache quantization without needing to call an API endpoint from your inference harness. This would give you as much context as you can fit on your device, but when you approach the limits of your device, it quantizes your kvcache automatically.

I can't find it now, but I recently read either a paper (or possibly a blog post by a cloud provider) where they found that quantisation (of models weights though IIRC) was a lot less damaging for (batched) prompt processing than it was for (autoregressive) token generation, and they concluded it could be worthwhile serving a different quant for each (hence why it might have been a blog post by a cloud provider).

This makes you wonder if the same is true for KV-cache quantisation and some kind of "multistage cache" could be helpful?

@wadealexc
Copy link
Copy Markdown
Author

This makes you wonder if the same is true for KV-cache quantisation and some kind of "multistage cache" could be helpful?

Interesting; I don't know about batched processing, but I do think there's some room for experimentation with maybe selectively quantizing portions of the cache? Like maybe you want earlier parts of a session to be more quantized, with more precision aimed at newer tokens?

- refac: rename endpoint to /cache/requantize
@wadealexc wadealexc changed the title [PoC] server: support requantizing kv cache server: support requantizing kv cache Jun 5, 2026
@wadealexc wadealexc changed the title server: support requantizing kv cache [PoC] server: support requantizing kv cache Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants