Skip to content

docs: GPU-path research report + num-G kill and LDM census evidence#149

Merged
ChrisLundquist merged 5 commits into
masterfrom
claude/gpu-research-findings
Jun 10, 2026
Merged

docs: GPU-path research report + num-G kill and LDM census evidence#149
ChrisLundquist merged 5 commits into
masterfrom
claude/gpu-research-findings

Conversation

@ChrisLundquist

Copy link
Copy Markdown
Owner

Summary

Consolidates the 2026-06-10 GPU-path research into master. The multi-agent research run (4 repo-history readers, 5 web researchers, 3 adversarial verification lenses per candidate, 37 agents total) concluded that libpz's GPU-entropy dead end measured a specific low-parallelism stream topology rather than GPU entropy coding itself, but that on M5 unified memory the GPU has no bandwidth moat over 18 CPU cores already delivering 12.1 GiB/s pz2 decode — so the encode side is settled (hybrid GPU-match + CPU parse stands), and only one decode path is credible: a GDeflate-shaped evolution of pz2's wire (pz2-G32), with everything else spike-gated or pareto-rejected. This PR adds the final report (docs/design-docs/gpu-path-research.md), the stage-0 num-bitpack kill evidence (findings doc, numeric::bitpack probe code, examples/num_bitpack_probe.rs — deterministic, exits 1/KILL: +18.34/+9.67/+2.09/+47.06pp vs per-plane FSE on sao/x-ray/mr/nci), and the LDM census evidence (docs/design-docs/ldm-census-findings.md, examples/ldm_census.rs — PASS: 14.15%/56.08% verified >1 MiB-offset coverage on tarball-class corpora, zstd --long=30 bar at 25.26%/12.99%), plus a reconciliation postscript noting that the pz2d dict tier shipped in #146#148 implements the census's recommended immutable depth-1 architecture but is segment-scoped (<32 MiB reach), so the census's key finding — duplicate mass at 256–540 MiB offsets — remains uncaptured and the whole-input-scoped front-end is still open.

Spike verdicts

Spike (from report §5) Branch Verdict
num-G stage 0 (vertical bit-pack vs per-plane FSE) claude/spike-num-bitpack KILL — fails the ≤+2pp gate on all four target files (evidence in this PR)
LDM census (CPU long-range dedup) claude/spike-ldm-census PASS — 4.7–18.7x above the 3% kill line on tarballs; route to whole-input immutable-dict front-end (evidence in this PR)
pz2-G32 stage 1 (32-lane literal relayout) claude/spike-pz2-g32 stage-1 PASS (+0.0066pp aggregate) — in flight, lands in its own PR
gpu-ibwt (sampled-LF cursor chase) claude/spike-ibwt-cursors GPU passes its gate but rejected; CPU K=8 multi-cursor rider proposed — in flight, lands in its own PR

Test plan

  • ./scripts/test.sh green (fmt, clippy --all-targets incl. the two new examples, build, 744 tests; one webgpu perf-comparison test flaked under shared-machine load and passes in isolation)
  • Both probes re-run against current master: num-bitpack reproduces the findings-doc numbers exactly and exits KILL; ldm_census reproduces 6.18% Silesia-blob coverage exactly

🤖 Generated with Claude Code

Chris Lundquist and others added 5 commits June 10, 2026 12:07
ndzip-style entropy-stage candidate for the Num pipeline: 32-value x 8-bitplane
transpose + zero-word elimination with a per-group presence bitmap, exact
inverse, plus probe_block() instrumentation that runs the shipping front-end
(stride sweep + FSE-gated transforms) and counts bytes per plane under both
entropy stages. Probe-only; nothing on the wire changes.

Result: KILL on all four target files (sao +18.3pp, x-ray +9.7pp, mr +2.1pp,
nci +47.1pp vs per-plane FSE). Findings doc follows.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per-plane byte tables for sao/x-ray/mr/nci, oracle analysis (FSE wins every
plane on 3/4 files; mr low plane is the lone 0.983x bitpack win), mechanism
writeup, and the verdict that the CPU-SIMD next gate is not reached.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…on tarball-class corpora

CPU-only census probe (examples/ldm_census.rs): exact 8-byte fingerprints
at stride 32, sort, nearest >1MiB same-fp pairing, byte-verified greedy
extension, bitmap-deduped coverage.

Results (min_match=64): silesia.blob 6.18%, rustup-one.tar 14.15%,
rustup-two.tar 56.08% — kill line was <3% on tarball-class. zstd -3
--long=30 confirms realizability: -3.78pp on single-toolchain tar,
-16.85pp (2.30x) on dual-toolchain tar vs zstd -3; --long=27's 128MiB
window captures <25% of the win. Recommendation: immutable depth-1
front-loaded-dictionary form, gated on heavy-tailed match lengths.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…chitecture, not the 256-540 MiB offsets

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit 49d116a into master Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant