SpectralQuant

3% Is All You Need: Breaking TurboQuant's Compression Limit via Spectral Structure

Canonical paper artifacts (NeurIPS 2026 submission):

Main paper: SpectralQuant_main.pdf (also at paper_neurips2026/spectralquant_neurips2026_main.pdf, 13 pages)

Supplement: SpectralQuant_supplement.pdf (also at paper_neurips2026/spectralquant_neurips2026_supplement.pdf, 8 pages)

LaTeX sources and figures: paper_neurips2026/ (main.tex, supplement.tex, neurips_2026.sty, refs_anon.bib, figures/).

Submission audit and round-by-round revision notes: paper_neurips2026/README_submission_audit.md.

Earlier manuscripts under paper_output/, paper_output_v2/, and paper_output_consolidated/ are retained for traceability only and are not the current paper.

About this repository

This is the canonical private full SpectralQuant repository: niashwin/spectralquant-full. It consolidates the original public release (Dynamis-Labs/spectralquant) with the expanded paper-valid evidence layer developed afterwards. The current paper is the NeurIPS 2026 submission under paper_neurips2026/ (root-level convenience copies: SpectralQuant_main.pdf and SpectralQuant_supplement.pdf). The earlier consolidated technical report under paper_output_consolidated/ and the original NeurIPS-format manuscript under paper_output/ are retained as historical artifacts only.

This repository was renamed from niashwin/spectralquant-v2 to niashwin/spectralquant-full on 2026-05-01. The pre-rename name niashwin/spectralquant-v2 is preserved verbatim in:

archived JSON repo fields under results/v3/modal/ (frozen experiment artifacts);
the Modal volume name spectralquant-v2-results;
historical filesystem paths (paper_output_v2/, docs/spectralquant_v2_technical_spec.md, experiments/sqv2_replay.py);
the JSON method key spectralquant_v2 and evidence-catalog identifiers (V1-*, V2-SPEC-*, RUN-*).

These historical labels are intentional traceability anchors and are documented in docs/consolidated_spectralquant_inventory.md and docs/claims_discipline.md. All public-facing references to the consolidation repository should use niashwin/spectralquant-full.

Overview

SpectralQuant is a KV cache compression method for large language model inference. It improves on TurboQuant (Zandieh et al., ICLR 2026) by exploiting a universal structural property: across six models in four architecture families, KV cache key vectors concentrate signal in only 3–4% of the head dimension.

By identifying these dimensions through a one-time 15-second calibration and removing error correction on the remaining 96–97% noise dimensions, SpectralQuant achieves better quality and better compression simultaneously.

Headline Results

	SpectralQuant	TurboQuant	Improvement
Cosine similarity (Qwen 2.5-14B)	0.9485	0.9226	+2.59 pp
Compression ratio	5.95×	5.02×	+18.6%
Latency (512 tokens)	0.257 ms/step	0.566 ms/step	2.2× faster
Perplexity (Qwen 7B, 1024 tok)	7.51	7.51	Compression-neutral

Key Findings

Universal low-rank structure. d_eff/head_dim ≈ 3–4% across Qwen (1.5B, 7B, 14B), Llama 3.1-8B, Mistral 7B, and Gemma 2-9B — the ratio is constant across head dimensions, model sizes, and architecture families.
Statistically significant. 10-seed CI on Qwen 2.5-1.5B: SQ mean=0.8635 ± 0.0024 vs TQ mean=0.8409 ± 0.0046, Wilcoxon p=0.031.
Faster at all sequence lengths. SQ is faster than TQ at 512, 1024, and 2048 tokens. No latency penalty for calibration-aware compression.
KV spectral asymmetry. Keys: d_eff ≈ 4. Values: d_eff ≈ 40–55 (10–15× larger). This explains why low-rank compression fails for values while SQ succeeds.

Quick Start

git clone https://github.com/dynamis-labs/spectralquant.git
cd spectralquant
pip install -e ".[dev]"

# Clone TurboQuant baseline
mkdir -p baseline
git clone https://github.com/DevTechJr/turboquant_cutile.git baseline/turboquant_cutile

# Run main experiment (quick mode)
PYTHONPATH=src python experiments/run_memory_efficiency.py --quick

Full Reproduction

# Core experiments
PYTHONPATH=src python experiments/neurips_models_asymmetry.py  # Mistral + Gemma + KV asymmetry
PYTHONPATH=src python experiments/neurips_seeds_latency.py     # 10-seed CI + latency crossover
PYTHONPATH=src python experiments/neurips_llama_full.py        # LongBench on Llama (requires HF_TOKEN)
PYTHONPATH=src python experiments/lowrank_cossim_sweep.py      # Low-rank sweep

Requirements

Python ≥ 3.10
PyTorch ≥ 2.2.0
CUDA GPU (experiments ran on NVIDIA B200)

Random Seeds

All experiments use seed 42 as default. The 10-seed CI test uses seeds: 42, 123, 7, 2024, 31415, 99, 1337, 8675309, 271828, 314159.

Paper Claims → Code → Data

Every number in the paper traces to a script and a result file in this repository.

Paper Section	Claim	Script	Result File
Abstract	SQ 0.9485 vs TQ 0.9226 on 14B (+2.59 pp)	`run_memory_efficiency.py`	`results/memory_efficiency/all_models.json`
Abstract	5.95× vs 5.02× compression	Analytical (bit accounting)	Same
Abstract	PPL=9.51 (Qwen 1.5B)	`run_v3_ppl_niah_v2.py`	`results/v3/v3_perplexity_v2.json`
Abstract	PPL=7.51 (Qwen 7B)	`neurips_seeds_latency.py`	`results/neurips/neurips_qwen7b_ppl.json`
Abstract	NIAH 10/10 (Llama)	`run_v3_ppl_niah_v2.py`	`results/v3/v3_niah_llama_v2.json`
Table 1	d_eff/head_dim ≈ 3–4% (6 models)	`neurips_models_asymmetry.py`	`results/neurips/neurips_*.json`
Table 3	Main results (4 models)	`run_memory_efficiency.py`	`results/memory_efficiency/all_models.json`
§Stats	Wilcoxon p=0.031, 10-seed CI	`neurips_seeds_latency.py`	`results/neurips/neurips_10seed.json`
§Cross-arch	Llama +1.74 pp, Mistral +1.21 pp, Gemma +0.72 pp	`neurips_models_asymmetry.py`	`results/neurips/neurips_*.json` + `results/v3/v3_crossarch.json`
§Dist shift	+2.1 to +3.6 pp across domains	`run_v3_deff_distshift_latency.py`	`results/v3/v3_distribution_shift.json`
§Latency	SQ faster at all seq lengths	`neurips_seeds_latency.py`	`results/neurips/neurips_latency_crossover.json`
§KV asymmetry	d_eff_keys≈4, d_eff_vals≈40–55	`neurips_models_asymmetry.py`	`results/neurips/neurips_kv_asymmetry.json`
§Low-rank	Values fail at r=4 (CosSim=0.15)	`lowrank_cossim_sweep.py`	`results/lowrank/lowrank_cossim_sweep.json`
§Calibration	CV=3.9%	`run_calibration_stability.py`	`results/calibration_stability/stability.json`
Ablation	Config G = 0.8741	`run_final_experiments.py`	`results/final/final_experiments.json`
§LongBench	Preliminary n=5	`neurips_llama_full.py`	`results/v3/v3_longbench.json`

Repository Structure

spectralquant/
├── src/spectralquant/           Core library (9 modules)
│   ├── calibration.py           Eigenspectral calibration (PCA, d_eff, κ)
│   ├── spectral_rotation.py     Spectral rotation vs random rotation baseline
│   ├── nonuniform_quantization.py  Lloyd-Max with per-regime codebooks
│   ├── selective_qjl.py         QJL correction on signal dims only
│   ├── engine.py                SpectralQuantEngine (subclasses TurboQuantEngine)
│   ├── spectralquant.py         Full standalone pipeline
│   ├── metrics.py               Cosine similarity, MSE, compression ratio
│   └── utils.py                 Seeds, model config, data loading
│
├── experiments/                 21 experiment scripts (see table above)
│
├── results/                     Raw experimental data (44 JSON files)
│   ├── memory_efficiency/       Main results: 4 models × TQ vs SQ
│   ├── neurips/                 10-seed CI, Gemma, Mistral, KV asymmetry, latency
│   ├── v3/                      Cross-arch, perplexity, NIAH, LongBench, d_eff
│   ├── final/                   Ablation table (Config F)
│   ├── calibration_stability/   Calibration stability (CV=3.9%)
│   ├── lowrank/                 Low-rank projection sweep (r=2..64)
│   ├── eigenspectral/           Phase 1 calibration (d_eff per layer, summary stats)
│   ├── baseline_reproduction/   Phase 0 baseline reproduction targets
│   ├── comparison/              Head-to-head TQ vs SQ with per-head statistics
│   ├── comprehensive/           Multi-model sweep across d_eff methods
│   ├── aggressive/              Aggressive compression variant metrics
│   ├── deff_sweep/              d_eff method comparison (participation ratio vs cumvar)
│   ├── kernel/                  Kernel benchmark timing
│   ├── seqlen_sweep/            Sequence length sweep (128–2048 tokens)
│   └── unnormalized/            Normalized vs unnormalized quantization
│
├── paper_neurips2026/           Canonical NeurIPS 2026 submission (current paper)
│   ├── main.tex / supplement.tex                      LaTeX sources
│   ├── neurips_2026.sty                               NeurIPS 2026 style file
│   ├── refs_anon.bib                                  Anonymized bibliography
│   ├── spectralquant_neurips2026_main.pdf             Compiled main PDF (13 pp)
│   ├── spectralquant_neurips2026_supplement.pdf      Compiled supplement PDF (8 pp)
│   ├── README_submission_audit.md                     Submission/review audit
│   └── figures/                                       Publication figures
│
├── SpectralQuant_main.pdf       Root-level copy of the canonical main PDF
├── SpectralQuant_supplement.pdf Root-level copy of the canonical supplement PDF
│
├── paper_output/                Earlier manuscript (retained for traceability only)
│   ├── spectralquant.tex        LaTeX source
│   ├── spectralquant_refs.bib   Bibliography
│   ├── spectralquant.pdf        Compiled PDF
│   ├── generate_figures.py      Figure generation script
│   └── figures/                 Publication figures (PDF + PNG)
│
├── tests/                       Test suite (5 files)
├── configs/                     Experiment configs (default + quick)
├── scripts/                     Setup and runner scripts
├── pyproject.toml               Package metadata
├── Makefile                     Build targets
└── LICENSE                      MIT

Experiment Scripts

Script	Description	Output
`neurips_models_asymmetry.py`	Mistral 7B + Gemma 2-9B + KV asymmetry (5 models)	`results/neurips/neurips_mistral.json`, `neurips_gemma.json`, `neurips_kv_asymmetry.json`
`neurips_seeds_latency.py`	10-seed CI + latency crossover + Qwen 7B PPL	`results/neurips/neurips_10seed.json`, `neurips_latency_crossover.json`, `neurips_qwen7b_ppl.json`
`neurips_llama_full.py`	LongBench (n=5, 6 subtasks) + NIAH on Llama 3.1-8B	`results/v3/v3_longbench.json`, `v3_niah_llama_v2.json`
`lowrank_cossim_sweep.py`	Low-rank SVD projection sweep (r=2..64)	`results/lowrank/lowrank_cossim_sweep.json`
`run_memory_efficiency.py`	Main results: 4 models × 9 configs	`results/memory_efficiency/all_models.json`
`run_v3_perplexity_crossarch.py`	Cross-architecture + 5-seed CI	`results/v3/v3_crossarch.json`
`run_v3_ppl_niah_v2.py`	Perplexity + NIAH (Llama)	`results/v3/v3_perplexity_v2.json`, `v3_niah_llama_v2.json`
`run_v3_deff_distshift_latency.py`	d_eff sweep + distribution shift + latency	`results/v3/v3_distribution_shift.json`, `v3_deff_sweep.json`
`run_final_experiments.py`	Config F ablation	`results/final/final_experiments.json`
`run_calibration_stability.py`	Calibration stability (CV=3.9%)	`results/calibration_stability/stability.json`

Attribution

TurboQuant — Zandieh, Daliri, Hadian, and Mirrokni (Google Research / Google DeepMind / NYU). Paper: arXiv:2504.19874, ICLR 2026. We use the community implementation by Anirudh Bharadwaj Vangara: DevTechJr/turboquant_cutile.

The Price of Meaning — Barman, Starenky, Bodnar, Narasimhan, and Gopinath (Sentra). Paper: arXiv:2603.27116. The eigenspectral analysis in SpectralQuant builds on the observation from this work that semantic memory systems exhibit universal low-rank structure in their representations.

Citation

@article{gopinath2026spectralquant,
  title={3\% Is All You Need: Breaking {TurboQuant}'s Compression Limit
         via Spectral Structure},
  author={Gopinath, Ashwin},
  year={2026},
  note={Sentra; MIT Department of Mechanical Engineering}
}

License

MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpectralQuant

About this repository

Overview

Headline Results

Key Findings

Quick Start

Full Reproduction

Requirements

Random Seeds

Paper Claims → Code → Data

Repository Structure

Experiment Scripts

Attribution

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
docs		docs
experiments		experiments
figures		figures
paper_output		paper_output
paper_output_consolidated		paper_output_consolidated
results		results
schemas		schemas
scripts		scripts
src/spectralquant		src/spectralquant
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
gen_shaped_cache_figures.py		gen_shaped_cache_figures.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SpectralQuant

About this repository

Overview

Headline Results

Key Findings

Quick Start

Full Reproduction

Requirements

Random Seeds

Paper Claims → Code → Data

Repository Structure

Experiment Scripts

Attribution

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages