feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data)#154
Merged
Conversation
Zero-GPU transcoder + scalar decoder for the pz2-G32 candidate (gpu-path-research.md #1, stage 1): repack the pz2 literal section into 32 lane-pinned sub-streams (literal i -> lane i%32, same shared canonical Huffman table), packed into 32-bit words interleaved in exact decode-read order via encoder-side simulation of the refill schedule (fetch one word when a lane is below 32 live bits; <=1 word/lane/round, inside the GDeflate <=2-word budget). Sequence section byte-identical. - pz2::transcode_g32 / pz2::decode_g32 (spike API, not production wire) - decode() splice loop factored out and shared by both decoders - examples/pz2_g32_probe.rs: per-file size delta, byte-exact round-trip, ST decode timing on Silesia Silesia (12 files, shipped 2 MiB auto-greedy recipe): aggregate delta +0.0066pp (kill line 0.5pp -> PASS), round-trip OK everywhere; scalar decode of the 32-lane layout runs 0.45-0.96x of the 8-lane decoder (worst on literal-heavy sao/x-ray). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…seline probe Round-based portable + NEON (aarch64) decoders for the G32 literal layout, spike-only probe hooks, and examples/pz2_g32_cpu_simd.rs measuring the literal phase and full decode ST/all-cores vs the shipped 8-lane decoder. Result: NEON doubles the stage-1 scalar decode (e.g. sao 553 -> 1209 MB/s) but the 32-lane wire is NOT a CPU win: 0.55-0.61x the shipped 8-lane literal phase ST, 5.0 vs 8.2 GB/s all-cores. The honest GPU denominator is 5.0 GB/s (NEON all-cores, this wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…~65 GB/s kernels/pz2_g32_lit.metal: one simdgroup per tile, 32 symbols/round via simd_ballot + popcount-prefix word fetch, 4 KB table in threadgroup memory (8 simdgroups/threadgroup sharing one table). examples/pz2_g32_metal.rs host probe: 551 tiles from real Silesia pz2 blocks, persistent shared buffers, byte-exact round-trip, GPUStartTime/GPUEndTime timing with chained dispatches to hold DVFS steady. Result: 0.50 ms for 32.6 MB literals = ~65 GB/s effective, ~5-6x the NEON all-cores same-wire baseline (9-10.5 GB/s) — far above the 5 GB/s kill line. Gate 3 (cooperative splice) unblocked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t, splice 79ms vs CPU 4.2ms on subset Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… with data Full-corpus (12 Silesia files, 211.9 MB, byte-exact end-to-end): - Gate 2 literal kernel: 64.75 GB/s (0.1% spread) — 13x over the kill line - Gate 3 splice: 53.43 ms = 3.97 GB/s; end-to-end 3.92 GB/s - CPU all-cores pz2 decode: 11.59 ms = 18.28 GB/s -> GPU = 0.21x. KILL. Phase split: splice is 53.4 of 54.0 ms; phase A (3 serial sequence-Huffman chains/block, 318 concurrent) is 38% of it, but even free it leaves a 33 ms copy loop = 2.9x slower than the entire CPU decode. The in-block LZ copy chain caps parallelism at block count; structural, not an optimization gap. Generalizes to all decode-side GPU candidates. Findings doc + feedback note for the CLAUDE.md dead-ends update included. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ration data Code (production-path review): splice_into_arena's lits-slack invariant is now a hard InvalidInput check (was debug_assert — compiled out in release, a silent heap overread for a future exact-sized caller); dead prefix arm dropped from the spike splice wrapper; #[doc(hidden)] on transcode_g32/ decode_g32 to match sibling spike symbols. Docs (spike/docs review): ran the missing --dup saturation experiment on the full corpus — dup=16 (1696 TGs, 3.4 GB in flight) lifts GPU end-to-end to 19.65 GB/s = 1.02x CPU (parity, never above). Rewrote the 'structural' section around the corrected mechanism (latency-bound ~5% occupancy at dup=1, shared-memory parity ceiling at saturation; the free-phase-A subtraction was non-additive), fixed NEON baseline provenance, narrowed the generalization claim to the defensible version, rewrote the feedback note so the CLAUDE.md dead-end promotion carries the saturated-parity framing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stage 2 of the pz2-G32 GPU-decode investigation (stage 1: the 32-lane GDeflate-style wire costs +0.0066pp — format is free). Three gates, run on the full Silesia corpus with byte-exact round-trip verification. Net verdict: GPU block decode is killed with data — gate 2's spectacular kernel cannot save an end-to-end that loses 0.21x to the CPU. This was the last surviving GPU-decode candidate from
docs/design-docs/gpu-path-research.md; the question is now closed on M5-class hardware with measurements instead of extrapolation.Gate verdicts
Why the splice is structural
The splice is 53.4 of the 54.0 ms end-to-end. Phase A (3 serial sequence-Huffman chains per block — only 318 concurrent chains) is 38% of it; even if phase A were free, the remaining 33 ms copy loop alone is 2.9x slower than the entire CPU decode. The in-block LZ copy chain caps parallelism at block count (106); fixing that means smaller blocks (ratio loss) or splice checkpoints (= the pz2-GA candidate, already Pareto-rejected). Generalizes to all decode-side GPU candidates (#3–#6 in the research report). Full analysis:
docs/design-docs/pz2-g32-stage2-findings.md.Contents
kernels/pz2_g32_lit.metal,kernels/pz2_g32_splice.metal— MSL kernels (SKIP_PHASE_Bisolates phase A for attribution)examples/pz2_g32_metal.rs— host probe: real pz2 blocks, persistent shared buffers, NEON + CPU baselines, byte-exact verification, DVFS-stable GPU timingexamples/pz2_g32_cpu_simd.rs— gate-1 NEON/portable decoderstranscode_g32/decode_g32/decode_g32_simdinsrc/pz2.rs— spike-only, no wire/container/default changesdocs/design-docs/pz2-g32-stage2-findings.md+.claude/feedback/2026-06-10-pz2-g32-stage2-kill.md(CLAUDE.md dead-ends amendment: GPU entropy decode is NOT slow — 65 GB/s with a codesigned wire — what kills GPU block decode is the splice + unified-memory bandwidth sharing)Rebase note: master's #150 arena decode rewrote
decode_with_prefixwhile this branch had extracted the splice for G32 reuse; resolved by keeping the arena design verbatim and extracting its splice loop intosplice_into_arena(behavior-identical;decode_into_arenanow calls it, spike decoders use a thin Vec wrapper). pz2 tests, quick suite, and GPU round-trips re-verified post-rebase. (Branch is-rebasedbecause force-push is disallowed;claude/pz2-g32-stage2on origin is the stale pre-rebase state.)🤖 Generated with Claude Code