Skip to content

crack666/kv-cache-quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KV-Cache Quantization for Long-Context LLMs

Thesis: Long-Context-Effizienz durch KV-Cache-Quantisierung bei Large Language Models
Author: Lennart Behr (Medieninformatik M.Sc., BHT Berlin)
Supervisor: Prof. Dr. Edlich
Hardware: NVIDIA RTX 5090 (32 GB VRAM)


Overview

Profiling code and measurement data for the master thesis on KV-Cache quantization. We systematically evaluate INT8, INT4, and INT2 quantization of the Key-Value Cache across four LLMs with different Grouped Query Attention (GQA) architectures — measuring memory, latency, perplexity, and long-context retrieval (Needle-in-a-Haystack).

Key Findings (WisPro Phase)

The GQA ratio is not a reliable predictor of quantization tolerance. Yi-1.5-9B (8:1) tolerates INT2 with only +15% PPL degradation, while Qwen2-7B (7:1) fails catastrophically at INT4.


Models Tested

Model Parameters GQA Ratio INT4 Tolerance INT2 Tolerance
Mistral-7B-v0.1 7.2B 4:1 ✅ Excellent (-1.0% PPL) ✅ Good (+2.2% PPL)
Qwen3-8B 8.2B 4:1 ✅ Good (+7.0% PPL) ⚠️ Degraded (+96% PPL)
Qwen2-7B 7.6B 7:1 ❌ Fails ❌ Fails
Yi-1.5-9B 8.8B 8:1 ✅ Excellent (-0.3% PPL) ✅ Usable (+15% PPL)

Repository Structure

.
├── scripts/
│   ├── profiler_suite.py           # Main CLI: profiling + benchmarks (PPL, Needle)
│   ├── benchmarks/
│   │   ├── needle_haystack.py      # Needle-in-a-Haystack (RULER-style echo-prompt)
│   │   └── perplexity.py           # WikiText-2 / PG-19 sliding-window PPL
│   ├── core/
│   │   ├── model_loader.py         # HF model loading with KV-quant config
│   │   ├── kv_cache.py             # Cache patching, size measurement, timings
│   │   ├── vram_profiler.py        # CUDA peak memory tracking
│   │   └── metrics.py              # Prefill latency, decode throughput
│   ├── aggregate_results.py        # Combine JSON results
│   ├── analyze_delta_ppl.py        # PPL degradation analysis
│   └── generate_*.py               # Table/figure generation
├── results/
│   ├── raw/                        # JSON measurement files
│   │   └── long_context/           # MA phase: 4k–32k context runs
│   ├── figures/                    # Generated plots (PDF)
│   └── tables/                     # LaTeX tables
├── requirements.txt
├── environment.yml
└── README.md

Quick Start

1. Setup Environment

conda env create -f environment.yml
conda activate kv-quant
# or
pip install -r requirements.txt

2. Run Profiling

# Full profiling + benchmarks for one model
python scripts/profiler_suite.py \
  --model mistralai/Mistral-7B-v0.1 \
  --attn-backend sdpa \
  --kv-quant none int8-hqq int4-hqq int2-hqq int2-hqq-kivi \
  --context-lengths 4096 8192 16384 32768 \
  --benchmarks ppl needle \
  --output-dir results/raw/long_context/ \
  --seed 42

# Minimal: profile one model, FP16 only
python scripts/profiler_suite.py --model gpt2 --context-lengths 128 256

Output: One JSON per (backend × kv-quant) combination containing:

  • Measurements per context length (prefill, decode, VRAM, KV size)
  • Perplexity (reference + quantized + delta)
  • Needle-in-a-Haystack trials (per depth × context length)

3. Patch Benchmarks (re-run without re-profiling)

If a benchmark needs to be re-run (e.g. after fixing the needle prompt), patch existing result files without repeating expensive profiling:

# Re-run needle on all Mistral results
python scripts/profiler_suite.py \
  --model mistralai/Mistral-7B-v0.1 \
  --patch results/raw/long_context/mistral_7b_*.json \
  --benchmarks needle

# Re-run PPL on a specific file, no backup
python scripts/profiler_suite.py \
  --model mistralai/Mistral-7B-v0.1 \
  --patch results/raw/long_context/mistral_7b_v0.1_sdpa_fp16_20260502_152932.json \
  --benchmarks ppl --no-backup

4. Analyze Results

python scripts/aggregate_results.py
python scripts/analyze_delta_ppl.py

Quantization Backend

We use HQQ (Half-Quadratic Quantization) via HuggingFace Transformers:

Config Bits Group Size Axis Mode
int8-hqq 8 64 0 Symmetric
int4-hqq 4 64 0 Symmetric
int2-hqq 2 16 0 Symmetric
int2-hqq-kivi 2 16 Keys: 0, Values: 1 Asymmetric (KIVI)

Residual Length: 128 tokens (last 128 KV entries remain in FP16).


Needle-in-a-Haystack Benchmark

Tests whether KV-cache quantization destroys long-range retrieval ability.

  • Format: Completion/echo-prompt (RULER-style, base-model friendly)
  • Needle: "The special magic number for this experiment is 7492."
  • Prompt: "The special magic number for this experiment is" (model completes)
  • Scoring: Case-insensitive string match ("7492" in output)
  • Reference: Hsieh et al. (2024), "RULER", COLM 2024

Practical Recommendations

  • INT8: Universally safe (<1% PPL degradation, no retrieval loss)
  • INT4: Requires model-specific validation
  • INT2: Only for robust models (Mistral-7B, Yi-1.5-9B)

Citation

@mastersthesis{behr2026kvcache,
  author = {Behr, Lennart},
  title = {Long-Context-Effizienz durch KV-Cache-Quantisierung bei Large Language Models},
  school = {Berliner Hochschule für Technik},
  year = {2026},
  type = {Master's Thesis}
}

License

MIT License. See individual model licenses for usage restrictions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors