release: 0.6.5#16
Merged
Merged
Conversation
Compressed batched E8P tensor-core GEMM fixes the B>1 (batched) decode cliff: the fused E8P linear's B>1 branch kept fully decompressing every weight to dense fp16 each step (17x behind bf16 at B=32); the new glq_matmul_e8p packs up to 8 tokens into the decode GEMV's idle mma.sync n-columns and decodes each weight tile once across the batch (B=32 2.5 -> 8.0 per-seq tok/s, 3.2x on a 12B/3bpw). Deterministic scratch+reduce makes the whole E8P kernel surface (B=1 GEMV + batched GEMM) bit-exact run-to-run. Validated end-to-end across 2/3/4 bpw.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps version to 0.6.5. Ships the compressed batched E8P tensor-core GEMM (B>1 decode cliff fix, 3.2x at B=32) and bit-exact determinism for the whole E8P kernel surface (B=1 GEMV + batched GEMM), validated end-to-end across 2/3/4 bpw. Merged PR #15.