feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data) by ChrisLundquist · Pull Request #154 · ChrisLundquist/libpz

ChrisLundquist · 2026-06-10T21:57:07Z

Summary

Stage 2 of the pz2-G32 GPU-decode investigation (stage 1: the 32-lane GDeflate-style wire costs +0.0066pp — format is free). Three gates, run on the full Silesia corpus with byte-exact round-trip verification. Net verdict: GPU block decode is killed with data — gate 2's spectacular kernel cannot save an end-to-end that loses 0.21x to the CPU. This was the last surviving GPU-decode candidate from docs/design-docs/gpu-path-research.md; the question is now closed on M5-class hardware with measurements instead of extrapolation.

Gate verdicts

Gate	Question	Verdict
1	Is the 32-lane wire a CPU win via NEON?	No — 0.55–0.61x the shipped 8-lane literal phase ST (NEON does double the scalar decode of this wire: sao 553→1209 MB/s)
2	Can Metal decode the literal phase >5 GB/s?	PASS — 64.75 GB/s (0.50 ms / 32.6 MB, 0.1% spread; simdgroup ballot + popcount-prefix fetch, 4 KB table in threadgroup memory)
3	Does end-to-end GPU block decode beat CPU all-cores?	KILL — 0.21x (GPU 54.01 ms vs CPU 11.59 ms = 18.28 GB/s on 211.9 MB)

Why the splice is structural

The splice is 53.4 of the 54.0 ms end-to-end. Phase A (3 serial sequence-Huffman chains per block — only 318 concurrent chains) is 38% of it; even if phase A were free, the remaining 33 ms copy loop alone is 2.9x slower than the entire CPU decode. The in-block LZ copy chain caps parallelism at block count (106); fixing that means smaller blocks (ratio loss) or splice checkpoints (= the pz2-GA candidate, already Pareto-rejected). Generalizes to all decode-side GPU candidates (#3–#6 in the research report). Full analysis: docs/design-docs/pz2-g32-stage2-findings.md.

kernels/pz2_g32_lit.metal, kernels/pz2_g32_splice.metal — MSL kernels (SKIP_PHASE_B isolates phase A for attribution)
examples/pz2_g32_metal.rs — host probe: real pz2 blocks, persistent shared buffers, NEON + CPU baselines, byte-exact verification, DVFS-stable GPU timing
examples/pz2_g32_cpu_simd.rs — gate-1 NEON/portable decoders
transcode_g32/decode_g32/decode_g32_simd in src/pz2.rs — spike-only, no wire/container/default changes
docs/design-docs/pz2-g32-stage2-findings.md + .claude/feedback/2026-06-10-pz2-g32-stage2-kill.md (CLAUDE.md dead-ends amendment: GPU entropy decode is NOT slow — 65 GB/s with a codesigned wire — what kills GPU block decode is the splice + unified-memory bandwidth sharing)

Rebase note: master's #150 arena decode rewrote decode_with_prefix while this branch had extracted the splice for G32 reuse; resolved by keeping the arena design verbatim and extracting its splice loop into splice_into_arena (behavior-identical; decode_into_arena now calls it, spike decoders use a thin Vec wrapper). pz2 tests, quick suite, and GPU round-trips re-verified post-rebase. (Branch is -rebased because force-push is disallowed; claude/pz2-g32-stage2 on origin is the stale pre-rebase state.)

🤖 Generated with Claude Code

Zero-GPU transcoder + scalar decoder for the pz2-G32 candidate (gpu-path-research.md #1, stage 1): repack the pz2 literal section into 32 lane-pinned sub-streams (literal i -> lane i%32, same shared canonical Huffman table), packed into 32-bit words interleaved in exact decode-read order via encoder-side simulation of the refill schedule (fetch one word when a lane is below 32 live bits; <=1 word/lane/round, inside the GDeflate <=2-word budget). Sequence section byte-identical. - pz2::transcode_g32 / pz2::decode_g32 (spike API, not production wire) - decode() splice loop factored out and shared by both decoders - examples/pz2_g32_probe.rs: per-file size delta, byte-exact round-trip, ST decode timing on Silesia Silesia (12 files, shipped 2 MiB auto-greedy recipe): aggregate delta +0.0066pp (kill line 0.5pp -> PASS), round-trip OK everywhere; scalar decode of the 32-lane layout runs 0.45-0.96x of the 8-lane decoder (worst on literal-heavy sao/x-ray). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…seline probe Round-based portable + NEON (aarch64) decoders for the G32 literal layout, spike-only probe hooks, and examples/pz2_g32_cpu_simd.rs measuring the literal phase and full decode ST/all-cores vs the shipped 8-lane decoder. Result: NEON doubles the stage-1 scalar decode (e.g. sao 553 -> 1209 MB/s) but the 32-lane wire is NOT a CPU win: 0.55-0.61x the shipped 8-lane literal phase ST, 5.0 vs 8.2 GB/s all-cores. The honest GPU denominator is 5.0 GB/s (NEON all-cores, this wire). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…~65 GB/s kernels/pz2_g32_lit.metal: one simdgroup per tile, 32 symbols/round via simd_ballot + popcount-prefix word fetch, 4 KB table in threadgroup memory (8 simdgroups/threadgroup sharing one table). examples/pz2_g32_metal.rs host probe: 551 tiles from real Silesia pz2 blocks, persistent shared buffers, byte-exact round-trip, GPUStartTime/GPUEndTime timing with chained dispatches to hold DVFS steady. Result: 0.50 ms for 32.6 MB literals = ~65 GB/s effective, ~5-6x the NEON all-cores same-wire baseline (9-10.5 GB/s) — far above the 5 GB/s kill line. Gate 3 (cooperative splice) unblocked. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…t, splice 79ms vs CPU 4.2ms on subset Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

… with data Full-corpus (12 Silesia files, 211.9 MB, byte-exact end-to-end): - Gate 2 literal kernel: 64.75 GB/s (0.1% spread) — 13x over the kill line - Gate 3 splice: 53.43 ms = 3.97 GB/s; end-to-end 3.92 GB/s - CPU all-cores pz2 decode: 11.59 ms = 18.28 GB/s -> GPU = 0.21x. KILL. Phase split: splice is 53.4 of 54.0 ms; phase A (3 serial sequence-Huffman chains/block, 318 concurrent) is 38% of it, but even free it leaves a 33 ms copy loop = 2.9x slower than the entire CPU decode. The in-block LZ copy chain caps parallelism at block count; structural, not an optimization gap. Generalizes to all decode-side GPU candidates. Findings doc + feedback note for the CLAUDE.md dead-ends update included. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ration data Code (production-path review): splice_into_arena's lits-slack invariant is now a hard InvalidInput check (was debug_assert — compiled out in release, a silent heap overread for a future exact-sized caller); dead prefix arm dropped from the spike splice wrapper; #[doc(hidden)] on transcode_g32/ decode_g32 to match sibling spike symbols. Docs (spike/docs review): ran the missing --dup saturation experiment on the full corpus — dup=16 (1696 TGs, 3.4 GB in flight) lifts GPU end-to-end to 19.65 GB/s = 1.02x CPU (parity, never above). Rewrote the 'structural' section around the corrected mechanism (latency-bound ~5% occupancy at dup=1, shared-memory parity ceiling at saturation; the free-phase-A subtraction was non-additive), fixed NEON baseline provenance, narrowed the generalization claim to the defensible version, rewrote the feedback note so the CLAUDE.md dead-end promotion carries the saturated-parity framing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Chris Lundquist and others added 7 commits June 10, 2026 14:49

docs: pz2-G32 stage-1 findings — PASS at +0.0066pp aggregate

c70101c

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

wip(pz2): G32 gate-3 cooperative splice kernel — end-to-end byte-exac…

e746d58

…t, splice 79ms vs CPU 4.2ms on subset Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ChrisLundquist merged commit eaa0069 into master Jun 10, 2026
4 checks passed

ChrisLundquist mentioned this pull request Jun 10, 2026

docs: maintainer pass — consume feedback backlog, reconcile GPU dead-ends + stale numbers #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data)#154

feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data)#154
ChrisLundquist merged 7 commits into
masterfrom
claude/pz2-g32-stage2-rebased

ChrisLundquist commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Jun 10, 2026

Summary

Gate verdicts

Why the splice is structural

Contents

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant