Skip to content

release: 0.6.5#16

Merged
cnygaard merged 1 commit into
mainfrom
release/0.6.5
Jun 22, 2026
Merged

release: 0.6.5#16
cnygaard merged 1 commit into
mainfrom
release/0.6.5

Conversation

@cnygaard

Copy link
Copy Markdown
Owner

Bumps version to 0.6.5. Ships the compressed batched E8P tensor-core GEMM (B>1 decode cliff fix, 3.2x at B=32) and bit-exact determinism for the whole E8P kernel surface (B=1 GEMV + batched GEMM), validated end-to-end across 2/3/4 bpw. Merged PR #15.

Compressed batched E8P tensor-core GEMM fixes the B>1 (batched) decode cliff:
the fused E8P linear's B>1 branch kept fully decompressing every weight to dense
fp16 each step (17x behind bf16 at B=32); the new glq_matmul_e8p packs up to 8
tokens into the decode GEMV's idle mma.sync n-columns and decodes each weight
tile once across the batch (B=32 2.5 -> 8.0 per-seq tok/s, 3.2x on a 12B/3bpw).
Deterministic scratch+reduce makes the whole E8P kernel surface (B=1 GEMV +
batched GEMM) bit-exact run-to-run. Validated end-to-end across 2/3/4 bpw.
@cnygaard cnygaard merged commit e2aac6d into main Jun 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant