release: 0.6.5 by cnygaard · Pull Request #16 · cnygaard/glq

cnygaard · 2026-06-22T21:25:54Z

Bumps version to 0.6.5. Ships the compressed batched E8P tensor-core GEMM (B>1 decode cliff fix, 3.2x at B=32) and bit-exact determinism for the whole E8P kernel surface (B=1 GEMV + batched GEMM), validated end-to-end across 2/3/4 bpw. Merged PR #15.

Compressed batched E8P tensor-core GEMM fixes the B>1 (batched) decode cliff: the fused E8P linear's B>1 branch kept fully decompressing every weight to dense fp16 each step (17x behind bf16 at B=32); the new glq_matmul_e8p packs up to 8 tokens into the decode GEMV's idle mma.sync n-columns and decodes each weight tile once across the batch (B=32 2.5 -> 8.0 per-seq tok/s, 3.2x on a 12B/3bpw). Deterministic scratch+reduce makes the whole E8P kernel surface (B=1 GEMV + batched GEMM) bit-exact run-to-run. Validated end-to-end across 2/3/4 bpw.

cnygaard merged commit e2aac6d into main Jun 22, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: 0.6.5#16

release: 0.6.5#16
cnygaard merged 1 commit into
mainfrom
release/0.6.5

cnygaard commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cnygaard commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant