perf(pz2d): arena decode — 36.6 ms blob decode (−13%), corrected bottleneck diagnosis by ChrisLundquist · Pull Request #150 · ChrisLundquist/libpz

ChrisLundquist · 2026-06-10T19:30:21Z

What

Task #12 part 1: the Pz2d arena decode. pz2::decode_into_arena is the new splice core — it takes an arena that already holds the dict, never writes below it, and lets callers truncate-and-reuse it across blocks. decode_with_prefix is now a thin wrapper (zero behavior change for every existing caller). The container's wave 2 fans out narrowly (available_parallelism/4 workers per segment, strided blocks) so each worker pays one dict copy amortized over its share; wave 1 decodes the dict chain directly into the output vec with zero prefix copies.

Measured (202 MB Silesia blob, warm-cache hyperfine, M5 Max)

	ratio	dec wall	user CPU	sys
pz2d v1 (master, same day)	30.48%	42.1 ms	250 ms	72 ms
pz2d v2 (this PR)	30.48%	36.6 ms	190 ms	40 ms
pz2 non-dict (both)	31.04%	17.8 ms	—	—

pz2's non-dict path is wall-identical after fixing a self-inflicted allocation gotcha: fresh decode buffers must come from vec![0; n] (calloc → pre-zeroed pages), not with_capacity + resize (explicit memset, double page touch) — that alone was +6 ms. Feedback note included.

The diagnosis correction (the real finding)

§11b projected ~17-20 ms assuming decode was memory-traffic-bound. It is DRAM-latency-bound on the format's random dict reads: outer-thread sweep hits a hard ~40 ms floor (t=6: 40.6, t=7: 41.9, t=18: 46.0) at only ~5 GB/s aggregate; ST per-segment is 16.8 ms (on model) with 2.4× concurrent inflation. W=1 inner workers — traffic-equivalent to the two-region splice — does not beat W=4, so the two-region splice is recorded as not worth its unsafe complexity. A 4 MiB-dict build measures 30.73% / 31.3 ms / 20.1 s enc (curve point if a dict-size header field is ever added; 16 MiB stays shipped as the max-ratio tier). Full ledger in clean-slate-codec.md §11c.

Safety

The unsafe wildcopy splice changed (buffer ownership, not copy discipline — all bounds validations and slack proofs preserved, invariant comments re-derived; the dict region is provably never stored to). Soaked: release 180 s + 120 s on final code, debug 45 s — zero panics across 30k round-trips / 1.2M mutated decodes / 240k garbage decodes. Suite: 745 + 597 tests green incl. new test_arena_reuse_round_trip; fmt/clippy clean; blob round-trips verified for pz2 and pz2d.

Next (task #12 part 2): encode-side concurrent cache inflation (23.7 s, 2.9×) — dict chain caps / sampled insertion.

🤖 Generated with Claude Code

…leneck diagnosis Wave-2 v1 re-zeroed an 18 MiB buffer and re-copied the 16 MiB dict for every block (~3.3 GB traffic/segment fleet). Now pz2::decode_into_arena takes an arena already holding the dict, never writes below it, and callers truncate-and-reuse it; decode_with_prefix is a thin wrapper. The container's wave 2 fans out narrowly (parallelism/4 workers per segment, strided blocks) so each worker pays one amortized dict copy, and wave 1 decodes the chain directly into the output vec. Blob (warm-cache hyperfine): pz2d 42.1 -> 36.6 ms at unchanged 30.48%, user CPU 250 -> 190 ms, sys 72 -> 40 ms. pz2 non-dict path unchanged (17.8 ms, re-anchored same day) after fixing an allocation gotcha: fresh decode buffers must use vec![0; n] (calloc, lazy faulting), not with_capacity + resize (explicit memset) — that alone was +6 ms. The 17-20 ms projection in §11b was wrong: thread/worker sweeps show decode is DRAM-latency-bound on the format's random dict reads (~5 GB/s aggregate at a hard ~40 ms floor across t=6..18), not traffic-bound. W=1 (traffic-equivalent to a two-region splice) does not beat W=4, so the two-region splice is recorded as not worth building. A 4 MiB-dict build measures 30.73% / 31.3 ms / 20.1 s — a real curve point if a dict-size header is ever added; 16 MiB stays shipped as the max-ratio tier. Ledger: clean-slate-codec.md §11c. Soaked: release 180 s + 120 s on final code, debug 45 s, zero panics. Suite: 745 + 597 green, fmt/clippy clean, blob round-trips verified. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ChrisLundquist merged commit d1f82c9 into master Jun 10, 2026
4 checks passed

ChrisLundquist deleted the claude/pz2d-arena-decode branch June 10, 2026 19:30

This was referenced Jun 10, 2026

perf(pz2d): cap frozen-dict chain walks at 16 links — encode −44% for +0.06pp #152

Merged

feat(pz2): G32 stage 2 — Metal literal decode 65 GB/s, but the splice is the wall (GPU decode killed with data) #154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pz2d): arena decode — 36.6 ms blob decode (−13%), corrected bottleneck diagnosis#150

perf(pz2d): arena decode — 36.6 ms blob decode (−13%), corrected bottleneck diagnosis#150
ChrisLundquist merged 1 commit into
masterfrom
claude/pz2d-arena-decode

ChrisLundquist commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Jun 10, 2026

What

Measured (202 MB Silesia blob, warm-cache hyperfine, M5 Max)

The diagnosis correction (the real finding)

Safety

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant