Summary
We're sharing research findings on KV cache compression that may be useful to the Anthropic team. This is a philanthropic contribution — no ask, just sharing what we found.
Key Findings
1. Hybrid attention models achieve lossless KV cache compression at 12x
Tested across 4 models (GPT-2, Qwen2.5-7B, Qwen3.5-4B, Qwen3.5-9B):
| Model |
Architecture |
Combined 12x BLEU |
PPL increase |
| GPT-2 124M |
Standard |
0.464 |
+16% |
| Qwen2.5-7B |
Standard + GQA |
0.063 |
>100x |
| Qwen3.5-4B |
Hybrid + GQA |
1.000 |
+3.3% |
| Qwen3.5-9B |
Hybrid + GQA |
1.000 |
+1.2% |
Hybrid architectures (mixing full attention with linear/SSM layers) tolerate aggressive KV quantization because the non-attention layers absorb and correct quantization errors.
2. Per-head entropy-adaptive bit allocation improves quality +8%
The optimal bit allocation per attention head follows:
b_h = b_avg + 0.25 * (H_avg - H_2(h))
Where H_2(h) is the Renyi entropy. Low-entropy (focused) heads get more bits, high-entropy (diffuse) heads get fewer.
3. The Pareto insight: 2% of heads cause most quality loss
The bottom 2% of heads by entropy (attention sinks) dominate the quantization error budget. Skipping quantization on just these heads provides more benefit than optimal bit redistribution across all others.
4. Production validated
A/B benchmarked on RTX 3060: q4_0 KV cache runs within 2.6% of f16 speed. 33K token conversations at 36.5 tok/s. 128K context on 12GB GPU (was 32K without compression).
Resources
Why share this
These techniques could help reduce serving costs and extend context windows for any large-scale LLM deployment. The math is clean, the experiments are reproducible, and the production deployment is trivial (two server flags for the quantization half). Sharing openly in case any of this is useful.
Summary
We're sharing research findings on KV cache compression that may be useful to the Anthropic team. This is a philanthropic contribution — no ask, just sharing what we found.
Key Findings
1. Hybrid attention models achieve lossless KV cache compression at 12x
Tested across 4 models (GPT-2, Qwen2.5-7B, Qwen3.5-4B, Qwen3.5-9B):
Hybrid architectures (mixing full attention with linear/SSM layers) tolerate aggressive KV quantization because the non-attention layers absorb and correct quantization errors.
2. Per-head entropy-adaptive bit allocation improves quality +8%
The optimal bit allocation per attention head follows:
Where H_2(h) is the Renyi entropy. Low-entropy (focused) heads get more bits, high-entropy (diffuse) heads get fewer.
3. The Pareto insight: 2% of heads cause most quality loss
The bottom 2% of heads by entropy (attention sinks) dominate the quantization error budget. Skipping quantization on just these heads provides more benefit than optimal bit redistribution across all others.
4. Production validated
A/B benchmarked on RTX 3060: q4_0 KV cache runs within 2.6% of f16 speed. 33K token conversations at 36.5 tok/s. 128K context on 12GB GPU (was 32K without compression).
Resources
Why share this
These techniques could help reduce serving costs and extend context windows for any large-scale LLM deployment. The math is clean, the experiments are reproducible, and the production deployment is trivial (two server flags for the quantization half). Sharing openly in case any of this is useful.