docs: GPU-path research report + num-G kill and LDM census evidence by ChrisLundquist · Pull Request #149 · ChrisLundquist/libpz

ChrisLundquist · 2026-06-10T19:17:07Z

Summary

Consolidates the 2026-06-10 GPU-path research into master. The multi-agent research run (4 repo-history readers, 5 web researchers, 3 adversarial verification lenses per candidate, 37 agents total) concluded that libpz's GPU-entropy dead end measured a specific low-parallelism stream topology rather than GPU entropy coding itself, but that on M5 unified memory the GPU has no bandwidth moat over 18 CPU cores already delivering 12.1 GiB/s pz2 decode — so the encode side is settled (hybrid GPU-match + CPU parse stands), and only one decode path is credible: a GDeflate-shaped evolution of pz2's wire (pz2-G32), with everything else spike-gated or pareto-rejected. This PR adds the final report (docs/design-docs/gpu-path-research.md), the stage-0 num-bitpack kill evidence (findings doc, numeric::bitpack probe code, examples/num_bitpack_probe.rs — deterministic, exits 1/KILL: +18.34/+9.67/+2.09/+47.06pp vs per-plane FSE on sao/x-ray/mr/nci), and the LDM census evidence (docs/design-docs/ldm-census-findings.md, examples/ldm_census.rs — PASS: 14.15%/56.08% verified >1 MiB-offset coverage on tarball-class corpora, zstd --long=30 bar at 25.26%/12.99%), plus a reconciliation postscript noting that the pz2d dict tier shipped in #146–#148 implements the census's recommended immutable depth-1 architecture but is segment-scoped (<32 MiB reach), so the census's key finding — duplicate mass at 256–540 MiB offsets — remains uncaptured and the whole-input-scoped front-end is still open.

Spike verdicts

Spike (from report §5)	Branch	Verdict
num-G stage 0 (vertical bit-pack vs per-plane FSE)	`claude/spike-num-bitpack`	KILL — fails the ≤+2pp gate on all four target files (evidence in this PR)
LDM census (CPU long-range dedup)	`claude/spike-ldm-census`	PASS — 4.7–18.7x above the 3% kill line on tarballs; route to whole-input immutable-dict front-end (evidence in this PR)
pz2-G32 stage 1 (32-lane literal relayout)	`claude/spike-pz2-g32`	stage-1 PASS (+0.0066pp aggregate) — in flight, lands in its own PR
gpu-ibwt (sampled-LF cursor chase)	`claude/spike-ibwt-cursors`	GPU passes its gate but rejected; CPU K=8 multi-cursor rider proposed — in flight, lands in its own PR

Test plan

./scripts/test.sh green (fmt, clippy --all-targets incl. the two new examples, build, 744 tests; one webgpu perf-comparison test flaked under shared-machine load and passes in isolation)
Both probes re-run against current master: num-bitpack reproduces the findings-doc numbers exactly and exits KILL; ldm_census reproduces 6.18% Silesia-blob coverage exactly

🤖 Generated with Claude Code

ndzip-style entropy-stage candidate for the Num pipeline: 32-value x 8-bitplane transpose + zero-word elimination with a per-group presence bitmap, exact inverse, plus probe_block() instrumentation that runs the shipping front-end (stride sweep + FSE-gated transforms) and counts bytes per plane under both entropy stages. Probe-only; nothing on the wire changes. Result: KILL on all four target files (sao +18.3pp, x-ray +9.7pp, mr +2.1pp, nci +47.1pp vs per-plane FSE). Findings doc follows. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Per-plane byte tables for sao/x-ray/mr/nci, oracle analysis (FSE wins every plane on 3/4 files; mr low plane is the lone 0.983x bitpack win), mechanism writeup, and the verdict that the CPU-SIMD next gate is not reached. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…on tarball-class corpora CPU-only census probe (examples/ldm_census.rs): exact 8-byte fingerprints at stride 32, sort, nearest >1MiB same-fp pairing, byte-verified greedy extension, bitmap-deduped coverage. Results (min_match=64): silesia.blob 6.18%, rustup-one.tar 14.15%, rustup-two.tar 56.08% — kill line was <3% on tarball-class. zstd -3 --long=30 confirms realizability: -3.78pp on single-toolchain tar, -16.85pp (2.30x) on dual-toolchain tar vs zstd -3; --long=27's 128MiB window captures <25% of the win. Recommendation: immutable depth-1 front-loaded-dictionary form, gated on heavy-tailed match lengths. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…chitecture, not the 256-540 MiB offsets Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Chris Lundquist and others added 5 commits June 10, 2026 12:07

docs: GPU-path research final report (2026-06-10 multi-agent run)

f320a93

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

docs(ldm): reconciliation postscript — pz2d (#146-#148) covers the ar…

7924cf4

…chitecture, not the 256-540 MiB offsets Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ChrisLundquist merged commit 49d116a into master Jun 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: GPU-path research report + num-G kill and LDM census evidence#149

docs: GPU-path research report + num-G kill and LDM census evidence#149
ChrisLundquist merged 5 commits into
masterfrom
claude/gpu-research-findings

ChrisLundquist commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Jun 10, 2026

Summary

Spike verdicts

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant