E8P-RVQ on vLLM with CUDA graphs (single fused op, 2/3/4 bpw)#12
Merged
Conversation
Wire the E8P (padded-D̂8 tensor-core) codebook into glq_vllm so `--codebook
e8p` checkpoints serve under vLLM's FULL cudagraph path, sharing the one
validated forward with HF inference.
- config: GLQvLLMConfig carries `codebook` ("e8_shell"|"e8p"), threaded into
GLQLinearMethod via get_quant_method.
- custom_ops: register decode_matvec_e8p / decompress_packed_e8p /
lookupmatmul_e81b_k8 / decompress_e81b_packed as torch.ops.glq.* with fake
impls, so the e8p apply traces fullgraph and cudagraph-captures.
- quantized_linear: extract E8RHTLinear._e8p_linear_apply as the shared,
torch.ops-dispatched core (HF _forward_e8p + the vLLM apply both call it).
Anchor the B>1 decompressed weights to the activation device so the
fullgraph fake-tensor trace keeps a consistent device on the residual add.
- linear_method: e8p create_weights (single + fused GLQShardedParameter
shards), _ensure_codebook(e8p) device singletons, process_weights device
move + per-(shard) scalar/flag cache, and an apply dispatch to
_glq_apply_e8p (one call per shard, concatenated).
- codebook_e8p: build the synthesised grid and the round() matmul in fp32 so
codebook construction is independent of vLLM's fp16 default dtype.
Key fix: register the stage-2 residual buffer (Qidxs2_e8p at bpw>=4,
Qidxs2_e81b at bpw==3) at its FINAL shape in create_weights. vLLM manages and
restores registered-parameter .data after process_weights, so the empty(0)
sentinel + loader empty_like() resize stranded stage-2 on CPU and every
move-back was reverted, aborting the decompress kernel at capture. Full-size
registration takes the loader's in-place copy_ branch (the create_weights CUDA
storage vLLM keeps), matching Qidxs_e8p and the shell's Qidxs2.
Validated: 4bpw SmolLM3-3B serves on vLLM 0.23, coherent output, 51 decode
CUDA graphs captured.
glq_fused_linear_e8p_cuda collapses the per-linear E8P apply (input_rht + N-stage decode/matmul + ×Wscale + output_rht) into ONE opaque custom op, so vLLM's fullgraph dynamo trace sees a single node instead of the ~5-op chain inductor padded with materialised copies. It's a thin host-side orchestrator reusing the existing kernels (glq_input_rht_cuda / glq_decode_matvec_e8p / glq_decompress_packed_e8p / glq_output_rht_cuda) — no new device kernels. Covers stage-1 (2bpw) + E8P stage-2 (4bpw); E81B (3bpw) stays on the multi-op path. _e8p_linear_apply prefers it (kill-switch _GLQ_FUSED_E8P_ENABLED), so both HF and vLLM run the one path. Validated on SmolLM3-3B 4bpw (vLLM 0.23, Blackwell): - Bit-exact vs the multi-op path: prefill logits max_abs 0.0, identical greedy decode. - B=1 decode tok/s, vLLM cudagraph: 12.4 -> 85.5 (6.9x). The multi-op chain had made cudagraph 2x SLOWER than eager (compute lost to inductor copies); fused cudagraph is now 2.5x FASTER than eager (85.5 vs 33.7) and 2.7x the HF-eager baseline. Eager also gains (25.4 -> 33.7) from fewer dispatches.
glq_fused_linear_e8p_cuda now handles the 3bpw recipe's E81B stage-2 in the single opaque op (previously only stage-1 + E8P stage-2 / 4bpw; E81B fell back to the multi-op apply). Added qidxs2_e81b + e81b_codebook args; B=1 builds the (8,n_pad) input / (8,m_pad) accumulator and calls glq_lookupmatmul_e81b_k8, B>1 calls glq_decompress_e81b_packed — mirroring the multi-op _e8p_linear_apply e81b branches, reusing the existing kernels (no new device code). Threaded the two args through the binding, op schema, fake, and the apply call (dropped the not-has_e81b guard; falls back only if the e81b grid is absent). Validated on a freshly-quantized SmolLM2-135M @ 3bpw e8p (Blackwell): - Bit-exact vs the multi-op path: prefill max_abs 0.0, identical greedy decode. - Serves on vLLM 0.23 with FULL cudagraphs, coherent, capture 0.92 GiB. The single fused op now covers every E8P-RVQ bpw (2/3/4).
README: add an "E8P codebook (--codebook e8p) — derivative of QuIP#" section (2/3/4 bpw RVQ recipe, fused-op vLLM CUDA-graph serving) and rewrite the Acknowledgments to state plainly that the e8p path is derivative work porting QuIP#'s E8P codebook and tensor-core decode / residual kernels, with full credit and the QuIP# citation. glq_e8p.cu: add an SPDX-License-Identifier and a header crediting QuIP# (Cornell-RelaxML, GPL-3.0) with a GPL §5 modification notice. QuIP# is itself GPL-3.0, so the ported kernels are license-compatible with this project; the header records the provenance and the modifications (renamed glq_*, assembled into a self-contained TU).
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Makes the
--codebook e8p(QuIP#-derived) path serve on vLLM with FULL CUDAgraphs, and collapses the per-linear E8P apply into one fused op so cudagraph
actually pays off.
What
threaded through config; the 4 e8p ops registered as
torch.ops.glq.*;create_weights(single + fused QKV/gate_up shards), codebook cache,process_weightsdevice move, andapplydispatch all share the onevalidated
_e8p_linear_applyforward.glq_fused_linear_e8p_cuda— the whole linear(input RHT → E8P tensor-core decode, all RVQ stages → ×Wscale → output RHT)
as one opaque op, a thin host-side orchestrator reusing the existing kernels
(no new device code). vLLM's fullgraph trace then sees one node, like the
shell's
fused_linear.[E8P], 3[E8P, E81B],4
[E8P, E8P].glq_e8p.cucrediting QuIP# (Cornell-RelaxML) per GPL §5.Why it matters
The multi-op apply regressed under cudagraph (inductor pads the chain with
copies). The single fused op fixes it — B=1 decode, SmolLM3-3B 4bpw, RTX PRO
6000 Blackwell, vLLM 0.23:
Root-cause fix worth flagging
vLLM manages/restores registered-param
.dataafterprocess_weights, so theloader's
empty_like()resize stranded the stage-2 buffer on CPU and everymove-back was reverted (kernel aborted at capture). Fix: register the stage-2
buffer full-size so the loader does an in-place
copy_— matching theshell's
Qidxs2and the always-cudaQidxs_e8p.Validation
max_abs 0.0,identical greedy decode) — 4bpw SmolLM3-3B and a freshly-quantized 3bpw
SmolLM2-135M.
_GLQ_FUSED_E8P_ENABLEDkill-switch (also the A/B-test lever).Attribution
The
--codebook e8ppath is derivative work that ports QuIP#'s E8P codebookand tensor-core decode/residual kernels. Credit + citation are in the README
and the
glq_e8p.cuheader.