feat(batch): first-class e8p codebook in batch_quantization#17
Merged
Conversation
The batch quantizer only forwarded --codebook-size (a shell knob); e8p was
reachable only via the raw extra_args escape hatch. Make it first-class:
- QuantJob.codebook ("e8_shell" default | "e8_relaxed" | "e8p") -> forwarded
as --codebook in _shared_flags (omitted when default, so existing shell
commands are byte-identical).
- Validation: e8p is the fixed-grid RVQ recipe -> uniform integer bpw 2/3/4
only (rejects mixed precision and codebook_size with clear messages);
unknown codebook names rejected.
- output_name tags non-default codebooks: ...-GLQ-<n>bpw-e8p / -relaxed.
- Docstring + JOBS example updated with an e8p entry.
- 7 unit tests (flag forwarding, name tag, the four validation guards,
default-omits-flag); 21/21 pass.
Verified by dry-run: a gemma-4-12B-it@3bpw e8p job builds
`glq-quantize ... --bpw 3 --codebook e8p --streaming` and outputs to
gemma-4-12B-it-GLQ-3bpw-e8p.
cnygaard
added a commit
that referenced
this pull request
Jun 23, 2026
Block-diagonal E8P: the E8P codebook path no longer forces a full power-of-2 Hadamard. It pads each linear dim to a sum of pow2 blocks (each >= 64 cols / 16 rows) instead of the next power of two, so non-pow2 models stop bloating — gemma-4-31B E8P 3bpw drops from ~27 GB to ~11 GB (qidxs 22.5 -> 10.7 GiB) while the tensor-core decode + RVQ are untouched, still one fused op so FULL cudagraph still pays off. Serves on vLLM 0.23 FULL cudagraph; 1.5-1.9x faster decode than the shell codebook at matched quality. Folds in three fixes surfaced by validation: vLLM cudagraph buffer sizing, an HF-eager / large-batch shared-mem OOB (single-block RHT on a non-pow2 m_pad), and pow2<->block-diag back-compat loading (re-derives dims from the loaded Qidxs_e8p shape, so both legacy pow2 and new block-diag checkpoints load on every path). A GLQ_E8P_POW2 quantize toggle keeps the legacy full-Hadamard E8P available — measured ~1 PPL better at 2bpw, ~0.26 at 3bpw, so block-diag is near-free at >=3bpw. Also includes first-class E8P support in the batch_quantization tool (#17).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes the e8p codebook a first-class option in
scripts/batch_quantization.py. Previously the tool only forwarded--codebook-size(a shell knob), so e8p was reachable only through the rawextra_argsescape hatch.Changes
QuantJob.codebookfield —"e8_shell"(default) |"e8_relaxed"|"e8p"; forwarded as--codebookin_shared_flagsonly when non-default (existing shell commands stay byte-identical).__post_init__): e8p is the fixed-grid RVQ recipe → uniform integer bpw 2/3/4 only — rejects mixed precision andcodebook_size, and rejects unknown codebook names, all with clear messages.output_nametags non-default codebooks:…-GLQ-<n>bpw-e8p(and-relaxed).JOBSexample updated with an e8p entry.Tests
tests/test_batch_progress.py(flag forwarding, name tagging, the four validation guards, default-omits-flag). 21/21 pass.gemma-4-12B-it @ 3bpw e8pjob buildsglq-quantize … --bpw 3 --codebook e8p --streaming→gemma-4-12B-it-GLQ-3bpw-e8p.Follows the 0.6.5 e8p kernel work (PR #15). The smoke worker already gates
limit_mm_per_prompton the checkpoint architecture, so e8p (text-only or multimodal) smoke-tests unchanged.