Skip to content

feat(batch): first-class e8p codebook in batch_quantization#17

Merged
cnygaard merged 1 commit into
mainfrom
feat/batch-quant-e8p
Jun 22, 2026
Merged

feat(batch): first-class e8p codebook in batch_quantization#17
cnygaard merged 1 commit into
mainfrom
feat/batch-quant-e8p

Conversation

@cnygaard

Copy link
Copy Markdown
Owner

Summary

Makes the e8p codebook a first-class option in scripts/batch_quantization.py. Previously the tool only forwarded --codebook-size (a shell knob), so e8p was reachable only through the raw extra_args escape hatch.

Changes

  • QuantJob.codebook field — "e8_shell" (default) | "e8_relaxed" | "e8p"; forwarded as --codebook in _shared_flags only when non-default (existing shell commands stay byte-identical).
  • Validation (__post_init__): e8p is the fixed-grid RVQ recipe → uniform integer bpw 2/3/4 only — rejects mixed precision and codebook_size, and rejects unknown codebook names, all with clear messages.
  • output_name tags non-default codebooks: …-GLQ-<n>bpw-e8p (and -relaxed).
  • Docstring + JOBS example updated with an e8p entry.

Tests

  • 7 new unit tests in tests/test_batch_progress.py (flag forwarding, name tagging, the four validation guards, default-omits-flag). 21/21 pass.
  • Dry-run verified: a gemma-4-12B-it @ 3bpw e8p job builds glq-quantize … --bpw 3 --codebook e8p --streaminggemma-4-12B-it-GLQ-3bpw-e8p.

Follows the 0.6.5 e8p kernel work (PR #15). The smoke worker already gates limit_mm_per_prompt on the checkpoint architecture, so e8p (text-only or multimodal) smoke-tests unchanged.

The batch quantizer only forwarded --codebook-size (a shell knob); e8p was
reachable only via the raw extra_args escape hatch. Make it first-class:

- QuantJob.codebook ("e8_shell" default | "e8_relaxed" | "e8p") -> forwarded
  as --codebook in _shared_flags (omitted when default, so existing shell
  commands are byte-identical).
- Validation: e8p is the fixed-grid RVQ recipe -> uniform integer bpw 2/3/4
  only (rejects mixed precision and codebook_size with clear messages);
  unknown codebook names rejected.
- output_name tags non-default codebooks: ...-GLQ-<n>bpw-e8p / -relaxed.
- Docstring + JOBS example updated with an e8p entry.
- 7 unit tests (flag forwarding, name tag, the four validation guards,
  default-omits-flag); 21/21 pass.

Verified by dry-run: a gemma-4-12B-it@3bpw e8p job builds
`glq-quantize ... --bpw 3 --codebook e8p --streaming` and outputs to
gemma-4-12B-it-GLQ-3bpw-e8p.
@cnygaard cnygaard merged commit d3cb612 into main Jun 22, 2026
3 checks passed
cnygaard added a commit that referenced this pull request Jun 23, 2026
Block-diagonal E8P: the E8P codebook path no longer forces a full power-of-2
Hadamard. It pads each linear dim to a sum of pow2 blocks (each >= 64 cols /
16 rows) instead of the next power of two, so non-pow2 models stop bloating —
gemma-4-31B E8P 3bpw drops from ~27 GB to ~11 GB (qidxs 22.5 -> 10.7 GiB) while
the tensor-core decode + RVQ are untouched, still one fused op so FULL cudagraph
still pays off. Serves on vLLM 0.23 FULL cudagraph; 1.5-1.9x faster decode than
the shell codebook at matched quality. Folds in three fixes surfaced by
validation: vLLM cudagraph buffer sizing, an HF-eager / large-batch shared-mem
OOB (single-block RHT on a non-pow2 m_pad), and pow2<->block-diag back-compat
loading (re-derives dims from the loaded Qidxs_e8p shape, so both legacy pow2
and new block-diag checkpoints load on every path). A GLQ_E8P_POW2 quantize
toggle keeps the legacy full-Hadamard E8P available — measured ~1 PPL better at
2bpw, ~0.26 at 3bpw, so block-diag is near-free at >=3bpw.

Also includes first-class E8P support in the batch_quantization tool (#17).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant