[PoC] server: support requantizing kv cache#24134
Conversation
…or dequantize as needed - also expose current kvcache type name via GET /props
- remove unreachable v_trans branch
|
From the Reddit post:
I can't find it now, but I recently read either a paper (or possibly a blog post by a cloud provider) where they found that quantisation (of models weights though IIRC) was a lot less damaging for (batched) prompt processing than it was for (autoregressive) token generation, and they concluded it could be worthwhile serving a different quant for each (hence why it might have been a blog post by a cloud provider). This makes you wonder if the same is true for KV-cache quantisation and some kind of "multistage cache" could be helpful? |
Interesting; I don't know about batched processing, but I do think there's some room for experimentation with maybe selectively quantizing portions of the cache? Like maybe you want earlier parts of a session to be more quantized, with more precision aimed at newer tokens? |
- refac: rename endpoint to /cache/requantize
Overview
This PR:
POST /cache/requantize, which accepts new values forctkandctvto be applied to the server's current kvcache.llama_requantize_memory, which reads the state from the existing kvcache, then tears it down, creates a new one with the provided ctk/ctv, and restores the old kvcache (quantizing/dequantizing as appropriate).Note: This PR is incomplete. I'm submitting it because it's working for the defaultllama_kv_cache, and I am hoping for feedback before I spend time getting it working for the other kvcache implementations! (Namely, whether my approach seems directionally correct, and whether this is something llama.cpp actually wants.)Edit: after reviewing the other kvcache implementations, I realized the work I already did on
llama_kv_cacheis already sufficient to support all other architectures (except recurrent, which doesn't support quantization). I still need to support attention rotation and draft models, but I've tested this on Qwen3 and Qwen3.5 now, and it works for both.Additional information
Motivation
I want to be able to quantize my kvcache on the fly, so that my inference setup can start at high precision and step down as context fills, rather than sitting at low precision for the entire session.
Currently, the only way to achieve this is to unload+reload the entire model, which is slow and also requires re-processing the existing prompt. By exposing a
/cache/requantizeendpoint, I can unload+reload just the kvcache and avoid having to do prompt processing a second time.I think my approach could be improved: it might be nice to be able to query current kvcache size before using this method, as well as estimate savings from quantizing (like a dry-run option?). I also think it'd be cool to have the kvcache quantize automatically at certain context thresholds (maybe via a CLI flag). Anyway, I didn't want to get ahead of myself 😄
Still Missing
This PR is incomplete. I'm still missing:
I'd be happy to implement these (or make any other requested changes), but I wanted to make sure there's interest first before I go spend another week on it! Thank you 🙌
Future Work
GET /cache/usage: get a memory breakdown of the memory usage at the current quantization/cache/requantize: estimate how much memory would be freed by quantizing to the input k/v types--dynamic-cache): auto-quantize kvcache as needed when inference exceeds device capacityRequirements
I reviewed, refactored, and edited all the code in this PR and accept full responsibility.