Skip to content

E8P-RVQ on vLLM with CUDA graphs (single fused op, 2/3/4 bpw)#12

Merged
cnygaard merged 5 commits into
mainfrom
feat/e8p-vllm-cudagraph
Jun 21, 2026
Merged

E8P-RVQ on vLLM with CUDA graphs (single fused op, 2/3/4 bpw)#12
cnygaard merged 5 commits into
mainfrom
feat/e8p-vllm-cudagraph

Conversation

@cnygaard

Copy link
Copy Markdown
Owner

Makes the --codebook e8p (QuIP#-derived) path serve on vLLM with FULL CUDA
graphs
, and collapses the per-linear E8P apply into one fused op so cudagraph
actually pays off.

What

  • vLLM wiring — E8P checkpoints serve under the GLQ vLLM plugin: codebook
    threaded through config; the 4 e8p ops registered as torch.ops.glq.*;
    create_weights (single + fused QKV/gate_up shards), codebook cache,
    process_weights device move, and apply dispatch all share the one
    validated _e8p_linear_apply forward.
  • Fused single-call op glq_fused_linear_e8p_cuda — the whole linear
    (input RHT → E8P tensor-core decode, all RVQ stages → ×Wscale → output RHT)
    as one opaque op, a thin host-side orchestrator reusing the existing kernels
    (no new device code). vLLM's fullgraph trace then sees one node, like the
    shell's fused_linear.
  • Covers every E8P-RVQ bit-width: 2 [E8P], 3 [E8P, E81B],
    4 [E8P, E8P].
  • Docs + attribution: README E8P section, and an SPDX/GPL header on
    glq_e8p.cu crediting QuIP# (Cornell-RelaxML) per GPL §5.

Why it matters

The multi-op apply regressed under cudagraph (inductor pads the chain with
copies). The single fused op fixes it — B=1 decode, SmolLM3-3B 4bpw, RTX PRO
6000 Blackwell, vLLM 0.23:

mode multi-op fused
cudagraph 12.4 tok/s 85.5 (6.9×)
eager 25.4 33.7

Root-cause fix worth flagging

vLLM manages/restores registered-param .data after process_weights, so the
loader's empty_like() resize stranded the stage-2 buffer on CPU and every
move-back was reverted (kernel aborted at capture). Fix: register the stage-2
buffer full-size so the loader does an in-place copy_ — matching the
shell's Qidxs2 and the always-cuda Qidxs_e8p.

Validation

  • Bit-exact fused vs multi-op across 2/3/4 bpw (prefill max_abs 0.0,
    identical greedy decode) — 4bpw SmolLM3-3B and a freshly-quantized 3bpw
    SmolLM2-135M.
  • E8P serves on vLLM 0.23 with FULL cudagraphs, coherent, at 3bpw and 4bpw.
  • Behind a _GLQ_FUSED_E8P_ENABLED kill-switch (also the A/B-test lever).

Attribution

The --codebook e8p path is derivative work that ports QuIP#'s E8P codebook
and tensor-core decode/residual kernels. Credit + citation are in the README
and the glq_e8p.cu header.

cnygaard added 5 commits June 21, 2026 23:14
Wire the E8P (padded-D̂8 tensor-core) codebook into glq_vllm so `--codebook
e8p` checkpoints serve under vLLM's FULL cudagraph path, sharing the one
validated forward with HF inference.

- config: GLQvLLMConfig carries `codebook` ("e8_shell"|"e8p"), threaded into
  GLQLinearMethod via get_quant_method.
- custom_ops: register decode_matvec_e8p / decompress_packed_e8p /
  lookupmatmul_e81b_k8 / decompress_e81b_packed as torch.ops.glq.* with fake
  impls, so the e8p apply traces fullgraph and cudagraph-captures.
- quantized_linear: extract E8RHTLinear._e8p_linear_apply as the shared,
  torch.ops-dispatched core (HF _forward_e8p + the vLLM apply both call it).
  Anchor the B>1 decompressed weights to the activation device so the
  fullgraph fake-tensor trace keeps a consistent device on the residual add.
- linear_method: e8p create_weights (single + fused GLQShardedParameter
  shards), _ensure_codebook(e8p) device singletons, process_weights device
  move + per-(shard) scalar/flag cache, and an apply dispatch to
  _glq_apply_e8p (one call per shard, concatenated).
- codebook_e8p: build the synthesised grid and the round() matmul in fp32 so
  codebook construction is independent of vLLM's fp16 default dtype.

Key fix: register the stage-2 residual buffer (Qidxs2_e8p at bpw>=4,
Qidxs2_e81b at bpw==3) at its FINAL shape in create_weights. vLLM manages and
restores registered-parameter .data after process_weights, so the empty(0)
sentinel + loader empty_like() resize stranded stage-2 on CPU and every
move-back was reverted, aborting the decompress kernel at capture. Full-size
registration takes the loader's in-place copy_ branch (the create_weights CUDA
storage vLLM keeps), matching Qidxs_e8p and the shell's Qidxs2.

Validated: 4bpw SmolLM3-3B serves on vLLM 0.23, coherent output, 51 decode
CUDA graphs captured.
glq_fused_linear_e8p_cuda collapses the per-linear E8P apply (input_rht +
N-stage decode/matmul + ×Wscale + output_rht) into ONE opaque custom op, so
vLLM's fullgraph dynamo trace sees a single node instead of the ~5-op chain
inductor padded with materialised copies. It's a thin host-side orchestrator
reusing the existing kernels (glq_input_rht_cuda / glq_decode_matvec_e8p /
glq_decompress_packed_e8p / glq_output_rht_cuda) — no new device kernels.
Covers stage-1 (2bpw) + E8P stage-2 (4bpw); E81B (3bpw) stays on the multi-op
path. _e8p_linear_apply prefers it (kill-switch _GLQ_FUSED_E8P_ENABLED), so
both HF and vLLM run the one path.

Validated on SmolLM3-3B 4bpw (vLLM 0.23, Blackwell):
- Bit-exact vs the multi-op path: prefill logits max_abs 0.0, identical greedy
  decode.
- B=1 decode tok/s, vLLM cudagraph: 12.4 -> 85.5 (6.9x). The multi-op chain
  had made cudagraph 2x SLOWER than eager (compute lost to inductor copies);
  fused cudagraph is now 2.5x FASTER than eager (85.5 vs 33.7) and 2.7x the
  HF-eager baseline. Eager also gains (25.4 -> 33.7) from fewer dispatches.
glq_fused_linear_e8p_cuda now handles the 3bpw recipe's E81B stage-2 in the
single opaque op (previously only stage-1 + E8P stage-2 / 4bpw; E81B fell back
to the multi-op apply). Added qidxs2_e81b + e81b_codebook args; B=1 builds the
(8,n_pad) input / (8,m_pad) accumulator and calls glq_lookupmatmul_e81b_k8,
B>1 calls glq_decompress_e81b_packed — mirroring the multi-op _e8p_linear_apply
e81b branches, reusing the existing kernels (no new device code). Threaded the
two args through the binding, op schema, fake, and the apply call (dropped the
not-has_e81b guard; falls back only if the e81b grid is absent).

Validated on a freshly-quantized SmolLM2-135M @ 3bpw e8p (Blackwell):
- Bit-exact vs the multi-op path: prefill max_abs 0.0, identical greedy decode.
- Serves on vLLM 0.23 with FULL cudagraphs, coherent, capture 0.92 GiB.
The single fused op now covers every E8P-RVQ bpw (2/3/4).
README: add an "E8P codebook (--codebook e8p) — derivative of QuIP#" section
(2/3/4 bpw RVQ recipe, fused-op vLLM CUDA-graph serving) and rewrite the
Acknowledgments to state plainly that the e8p path is derivative work porting
QuIP#'s E8P codebook and tensor-core decode / residual kernels, with full
credit and the QuIP# citation.

glq_e8p.cu: add an SPDX-License-Identifier and a header crediting QuIP#
(Cornell-RelaxML, GPL-3.0) with a GPL §5 modification notice. QuIP# is itself
GPL-3.0, so the ported kernels are license-compatible with this project; the
header records the provenance and the modifications (renamed glq_*, assembled
into a self-contained TU).
@cnygaard cnygaard merged commit 5fafe05 into main Jun 21, 2026
@cnygaard cnygaard deleted the feat/e8p-vllm-cudagraph branch June 21, 2026 23:09
@cnygaard cnygaard mentioned this pull request Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant