E8P-RVQ on vLLM with CUDA graphs (single fused op, 2/3/4 bpw) by cnygaard · Pull Request #12 · cnygaard/glq

cnygaard · 2026-06-21T22:59:21Z

Makes the --codebook e8p (QuIP#-derived) path serve on vLLM with FULL CUDA
graphs, and collapses the per-linear E8P apply into one fused op so cudagraph
actually pays off.

What

vLLM wiring — E8P checkpoints serve under the GLQ vLLM plugin: codebook
threaded through config; the 4 e8p ops registered as torch.ops.glq.*;
create_weights (single + fused QKV/gate_up shards), codebook cache,
process_weights device move, and apply dispatch all share the one
validated _e8p_linear_apply forward.
Fused single-call op glq_fused_linear_e8p_cuda — the whole linear
(input RHT → E8P tensor-core decode, all RVQ stages → ×Wscale → output RHT)
as one opaque op, a thin host-side orchestrator reusing the existing kernels
(no new device code). vLLM's fullgraph trace then sees one node, like the
shell's fused_linear.
Covers every E8P-RVQ bit-width: 2 [E8P], 3 [E8P, E81B],
4 [E8P, E8P].
Docs + attribution: README E8P section, and an SPDX/GPL header on
glq_e8p.cu crediting QuIP# (Cornell-RelaxML) per GPL §5.

Why it matters

The multi-op apply regressed under cudagraph (inductor pads the chain with
copies). The single fused op fixes it — B=1 decode, SmolLM3-3B 4bpw, RTX PRO
6000 Blackwell, vLLM 0.23:

mode	multi-op	fused
cudagraph	12.4 tok/s	85.5 (6.9×)
eager	25.4	33.7

Root-cause fix worth flagging

vLLM manages/restores registered-param .data after process_weights, so the
loader's empty_like() resize stranded the stage-2 buffer on CPU and every
move-back was reverted (kernel aborted at capture). Fix: register the stage-2
buffer full-size so the loader does an in-place copy_ — matching the
shell's Qidxs2 and the always-cuda Qidxs_e8p.

Validation

Bit-exact fused vs multi-op across 2/3/4 bpw (prefill max_abs 0.0,
identical greedy decode) — 4bpw SmolLM3-3B and a freshly-quantized 3bpw
SmolLM2-135M.
E8P serves on vLLM 0.23 with FULL cudagraphs, coherent, at 3bpw and 4bpw.
Behind a _GLQ_FUSED_E8P_ENABLED kill-switch (also the A/B-test lever).

Attribution

The --codebook e8p path is derivative work that ports QuIP#'s E8P codebook
and tensor-core decode/residual kernels. Credit + citation are in the README
and the glq_e8p.cu header.

Wire the E8P (padded-D̂8 tensor-core) codebook into glq_vllm so `--codebook e8p` checkpoints serve under vLLM's FULL cudagraph path, sharing the one validated forward with HF inference. - config: GLQvLLMConfig carries `codebook` ("e8_shell"|"e8p"), threaded into GLQLinearMethod via get_quant_method. - custom_ops: register decode_matvec_e8p / decompress_packed_e8p / lookupmatmul_e81b_k8 / decompress_e81b_packed as torch.ops.glq.* with fake impls, so the e8p apply traces fullgraph and cudagraph-captures. - quantized_linear: extract E8RHTLinear._e8p_linear_apply as the shared, torch.ops-dispatched core (HF _forward_e8p + the vLLM apply both call it). Anchor the B>1 decompressed weights to the activation device so the fullgraph fake-tensor trace keeps a consistent device on the residual add. - linear_method: e8p create_weights (single + fused GLQShardedParameter shards), _ensure_codebook(e8p) device singletons, process_weights device move + per-(shard) scalar/flag cache, and an apply dispatch to _glq_apply_e8p (one call per shard, concatenated). - codebook_e8p: build the synthesised grid and the round() matmul in fp32 so codebook construction is independent of vLLM's fp16 default dtype. Key fix: register the stage-2 residual buffer (Qidxs2_e8p at bpw>=4, Qidxs2_e81b at bpw==3) at its FINAL shape in create_weights. vLLM manages and restores registered-parameter .data after process_weights, so the empty(0) sentinel + loader empty_like() resize stranded stage-2 on CPU and every move-back was reverted, aborting the decompress kernel at capture. Full-size registration takes the loader's in-place copy_ branch (the create_weights CUDA storage vLLM keeps), matching Qidxs_e8p and the shell's Qidxs2. Validated: 4bpw SmolLM3-3B serves on vLLM 0.23, coherent output, 51 decode CUDA graphs captured.

glq_fused_linear_e8p_cuda collapses the per-linear E8P apply (input_rht + N-stage decode/matmul + ×Wscale + output_rht) into ONE opaque custom op, so vLLM's fullgraph dynamo trace sees a single node instead of the ~5-op chain inductor padded with materialised copies. It's a thin host-side orchestrator reusing the existing kernels (glq_input_rht_cuda / glq_decode_matvec_e8p / glq_decompress_packed_e8p / glq_output_rht_cuda) — no new device kernels. Covers stage-1 (2bpw) + E8P stage-2 (4bpw); E81B (3bpw) stays on the multi-op path. _e8p_linear_apply prefers it (kill-switch _GLQ_FUSED_E8P_ENABLED), so both HF and vLLM run the one path. Validated on SmolLM3-3B 4bpw (vLLM 0.23, Blackwell): - Bit-exact vs the multi-op path: prefill logits max_abs 0.0, identical greedy decode. - B=1 decode tok/s, vLLM cudagraph: 12.4 -> 85.5 (6.9x). The multi-op chain had made cudagraph 2x SLOWER than eager (compute lost to inductor copies); fused cudagraph is now 2.5x FASTER than eager (85.5 vs 33.7) and 2.7x the HF-eager baseline. Eager also gains (25.4 -> 33.7) from fewer dispatches.

glq_fused_linear_e8p_cuda now handles the 3bpw recipe's E81B stage-2 in the single opaque op (previously only stage-1 + E8P stage-2 / 4bpw; E81B fell back to the multi-op apply). Added qidxs2_e81b + e81b_codebook args; B=1 builds the (8,n_pad) input / (8,m_pad) accumulator and calls glq_lookupmatmul_e81b_k8, B>1 calls glq_decompress_e81b_packed — mirroring the multi-op _e8p_linear_apply e81b branches, reusing the existing kernels (no new device code). Threaded the two args through the binding, op schema, fake, and the apply call (dropped the not-has_e81b guard; falls back only if the e81b grid is absent). Validated on a freshly-quantized SmolLM2-135M @ 3bpw e8p (Blackwell): - Bit-exact vs the multi-op path: prefill max_abs 0.0, identical greedy decode. - Serves on vLLM 0.23 with FULL cudagraphs, coherent, capture 0.92 GiB. The single fused op now covers every E8P-RVQ bpw (2/3/4).

README: add an "E8P codebook (--codebook e8p) — derivative of QuIP#" section (2/3/4 bpw RVQ recipe, fused-op vLLM CUDA-graph serving) and rewrite the Acknowledgments to state plainly that the e8p path is derivative work porting QuIP#'s E8P codebook and tensor-core decode / residual kernels, with full credit and the QuIP# citation. glq_e8p.cu: add an SPDX-License-Identifier and a header crediting QuIP# (Cornell-RelaxML, GPL-3.0) with a GPL §5 modification notice. QuIP# is itself GPL-3.0, so the ported kernels are license-compatible with this project; the header records the provenance and the modifications (renamed glq_*, assembled into a self-contained TU).

cnygaard added 5 commits June 21, 2026 23:14

docs(e8p): soften QuIP# citation request to opt-in

6cf723b

cnygaard merged commit 5fafe05 into main Jun 21, 2026

cnygaard deleted the feat/e8p-vllm-cudagraph branch June 21, 2026 23:09

cnygaard mentioned this pull request Jun 22, 2026

release: 0.6.4 #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E8P-RVQ on vLLM with CUDA graphs (single fused op, 2/3/4 bpw)#12

E8P-RVQ on vLLM with CUDA graphs (single fused op, 2/3/4 bpw)#12
cnygaard merged 5 commits into
mainfrom
feat/e8p-vllm-cudagraph

cnygaard commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cnygaard commented Jun 21, 2026

What

Why it matters

Root-cause fix worth flagging

Validation

Attribution

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant