Skip to content

feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data)#154

Merged
ChrisLundquist merged 7 commits into
masterfrom
claude/pz2-g32-stage2-rebased
Jun 10, 2026
Merged

feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data)#154
ChrisLundquist merged 7 commits into
masterfrom
claude/pz2-g32-stage2-rebased

Conversation

@ChrisLundquist

Copy link
Copy Markdown
Owner

Summary

Stage 2 of the pz2-G32 GPU-decode investigation (stage 1: the 32-lane GDeflate-style wire costs +0.0066pp — format is free). Three gates, run on the full Silesia corpus with byte-exact round-trip verification. Net verdict: GPU block decode is killed with data — gate 2's spectacular kernel cannot save an end-to-end that loses 0.21x to the CPU. This was the last surviving GPU-decode candidate from docs/design-docs/gpu-path-research.md; the question is now closed on M5-class hardware with measurements instead of extrapolation.

Gate verdicts

Gate Question Verdict
1 Is the 32-lane wire a CPU win via NEON? No — 0.55–0.61x the shipped 8-lane literal phase ST (NEON does double the scalar decode of this wire: sao 553→1209 MB/s)
2 Can Metal decode the literal phase >5 GB/s? PASS — 64.75 GB/s (0.50 ms / 32.6 MB, 0.1% spread; simdgroup ballot + popcount-prefix fetch, 4 KB table in threadgroup memory)
3 Does end-to-end GPU block decode beat CPU all-cores? KILL — 0.21x (GPU 54.01 ms vs CPU 11.59 ms = 18.28 GB/s on 211.9 MB)

Why the splice is structural

The splice is 53.4 of the 54.0 ms end-to-end. Phase A (3 serial sequence-Huffman chains per block — only 318 concurrent chains) is 38% of it; even if phase A were free, the remaining 33 ms copy loop alone is 2.9x slower than the entire CPU decode. The in-block LZ copy chain caps parallelism at block count (106); fixing that means smaller blocks (ratio loss) or splice checkpoints (= the pz2-GA candidate, already Pareto-rejected). Generalizes to all decode-side GPU candidates (#3#6 in the research report). Full analysis: docs/design-docs/pz2-g32-stage2-findings.md.

Contents

  • kernels/pz2_g32_lit.metal, kernels/pz2_g32_splice.metal — MSL kernels (SKIP_PHASE_B isolates phase A for attribution)
  • examples/pz2_g32_metal.rs — host probe: real pz2 blocks, persistent shared buffers, NEON + CPU baselines, byte-exact verification, DVFS-stable GPU timing
  • examples/pz2_g32_cpu_simd.rs — gate-1 NEON/portable decoders
  • transcode_g32/decode_g32/decode_g32_simd in src/pz2.rs — spike-only, no wire/container/default changes
  • docs/design-docs/pz2-g32-stage2-findings.md + .claude/feedback/2026-06-10-pz2-g32-stage2-kill.md (CLAUDE.md dead-ends amendment: GPU entropy decode is NOT slow — 65 GB/s with a codesigned wire — what kills GPU block decode is the splice + unified-memory bandwidth sharing)

Rebase note: master's #150 arena decode rewrote decode_with_prefix while this branch had extracted the splice for G32 reuse; resolved by keeping the arena design verbatim and extracting its splice loop into splice_into_arena (behavior-identical; decode_into_arena now calls it, spike decoders use a thin Vec wrapper). pz2 tests, quick suite, and GPU round-trips re-verified post-rebase. (Branch is -rebased because force-push is disallowed; claude/pz2-g32-stage2 on origin is the stale pre-rebase state.)

🤖 Generated with Claude Code

Chris Lundquist and others added 7 commits June 10, 2026 14:49
Zero-GPU transcoder + scalar decoder for the pz2-G32 candidate
(gpu-path-research.md #1, stage 1): repack the pz2 literal section into
32 lane-pinned sub-streams (literal i -> lane i%32, same shared canonical
Huffman table), packed into 32-bit words interleaved in exact decode-read
order via encoder-side simulation of the refill schedule (fetch one word
when a lane is below 32 live bits; <=1 word/lane/round, inside the
GDeflate <=2-word budget). Sequence section byte-identical.

- pz2::transcode_g32 / pz2::decode_g32 (spike API, not production wire)
- decode() splice loop factored out and shared by both decoders
- examples/pz2_g32_probe.rs: per-file size delta, byte-exact round-trip,
  ST decode timing on Silesia

Silesia (12 files, shipped 2 MiB auto-greedy recipe): aggregate delta
+0.0066pp (kill line 0.5pp -> PASS), round-trip OK everywhere; scalar
decode of the 32-lane layout runs 0.45-0.96x of the 8-lane decoder
(worst on literal-heavy sao/x-ray).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…seline probe

Round-based portable + NEON (aarch64) decoders for the G32 literal layout,
spike-only probe hooks, and examples/pz2_g32_cpu_simd.rs measuring the
literal phase and full decode ST/all-cores vs the shipped 8-lane decoder.

Result: NEON doubles the stage-1 scalar decode (e.g. sao 553 -> 1209 MB/s)
but the 32-lane wire is NOT a CPU win: 0.55-0.61x the shipped 8-lane
literal phase ST, 5.0 vs 8.2 GB/s all-cores. The honest GPU denominator
is 5.0 GB/s (NEON all-cores, this wire).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…~65 GB/s

kernels/pz2_g32_lit.metal: one simdgroup per tile, 32 symbols/round via
simd_ballot + popcount-prefix word fetch, 4 KB table in threadgroup memory
(8 simdgroups/threadgroup sharing one table). examples/pz2_g32_metal.rs
host probe: 551 tiles from real Silesia pz2 blocks, persistent shared
buffers, byte-exact round-trip, GPUStartTime/GPUEndTime timing with
chained dispatches to hold DVFS steady.

Result: 0.50 ms for 32.6 MB literals = ~65 GB/s effective, ~5-6x the NEON
all-cores same-wire baseline (9-10.5 GB/s) — far above the 5 GB/s kill
line. Gate 3 (cooperative splice) unblocked.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…t, splice 79ms vs CPU 4.2ms on subset

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… with data

Full-corpus (12 Silesia files, 211.9 MB, byte-exact end-to-end):
- Gate 2 literal kernel: 64.75 GB/s (0.1% spread) — 13x over the kill line
- Gate 3 splice: 53.43 ms = 3.97 GB/s; end-to-end 3.92 GB/s
- CPU all-cores pz2 decode: 11.59 ms = 18.28 GB/s -> GPU = 0.21x. KILL.

Phase split: splice is 53.4 of 54.0 ms; phase A (3 serial sequence-Huffman
chains/block, 318 concurrent) is 38% of it, but even free it leaves a 33 ms
copy loop = 2.9x slower than the entire CPU decode. The in-block LZ copy
chain caps parallelism at block count; structural, not an optimization gap.
Generalizes to all decode-side GPU candidates. Findings doc + feedback note
for the CLAUDE.md dead-ends update included.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ration data

Code (production-path review): splice_into_arena's lits-slack invariant is
now a hard InvalidInput check (was debug_assert — compiled out in release,
a silent heap overread for a future exact-sized caller); dead prefix arm
dropped from the spike splice wrapper; #[doc(hidden)] on transcode_g32/
decode_g32 to match sibling spike symbols.

Docs (spike/docs review): ran the missing --dup saturation experiment on the
full corpus — dup=16 (1696 TGs, 3.4 GB in flight) lifts GPU end-to-end to
19.65 GB/s = 1.02x CPU (parity, never above). Rewrote the 'structural'
section around the corrected mechanism (latency-bound ~5% occupancy at
dup=1, shared-memory parity ceiling at saturation; the free-phase-A
subtraction was non-additive), fixed NEON baseline provenance, narrowed the
generalization claim to the defensible version, rewrote the feedback note so
the CLAUDE.md dead-end promotion carries the saturated-parity framing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit eaa0069 into master Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant