Skip to content

perf(pz2d): arena decode — 36.6 ms blob decode (−13%), corrected bottleneck diagnosis#150

Merged
ChrisLundquist merged 1 commit into
masterfrom
claude/pz2d-arena-decode
Jun 10, 2026
Merged

perf(pz2d): arena decode — 36.6 ms blob decode (−13%), corrected bottleneck diagnosis#150
ChrisLundquist merged 1 commit into
masterfrom
claude/pz2d-arena-decode

Conversation

@ChrisLundquist

Copy link
Copy Markdown
Owner

What

Task #12 part 1: the Pz2d arena decode. pz2::decode_into_arena is the new splice core — it takes an arena that already holds the dict, never writes below it, and lets callers truncate-and-reuse it across blocks. decode_with_prefix is now a thin wrapper (zero behavior change for every existing caller). The container's wave 2 fans out narrowly (available_parallelism/4 workers per segment, strided blocks) so each worker pays one dict copy amortized over its share; wave 1 decodes the dict chain directly into the output vec with zero prefix copies.

Measured (202 MB Silesia blob, warm-cache hyperfine, M5 Max)

ratio dec wall user CPU sys
pz2d v1 (master, same day) 30.48% 42.1 ms 250 ms 72 ms
pz2d v2 (this PR) 30.48% 36.6 ms 190 ms 40 ms
pz2 non-dict (both) 31.04% 17.8 ms

pz2's non-dict path is wall-identical after fixing a self-inflicted allocation gotcha: fresh decode buffers must come from vec![0; n] (calloc → pre-zeroed pages), not with_capacity + resize (explicit memset, double page touch) — that alone was +6 ms. Feedback note included.

The diagnosis correction (the real finding)

§11b projected ~17-20 ms assuming decode was memory-traffic-bound. It is DRAM-latency-bound on the format's random dict reads: outer-thread sweep hits a hard ~40 ms floor (t=6: 40.6, t=7: 41.9, t=18: 46.0) at only ~5 GB/s aggregate; ST per-segment is 16.8 ms (on model) with 2.4× concurrent inflation. W=1 inner workers — traffic-equivalent to the two-region splice — does not beat W=4, so the two-region splice is recorded as not worth its unsafe complexity. A 4 MiB-dict build measures 30.73% / 31.3 ms / 20.1 s enc (curve point if a dict-size header field is ever added; 16 MiB stays shipped as the max-ratio tier). Full ledger in clean-slate-codec.md §11c.

Safety

The unsafe wildcopy splice changed (buffer ownership, not copy discipline — all bounds validations and slack proofs preserved, invariant comments re-derived; the dict region is provably never stored to). Soaked: release 180 s + 120 s on final code, debug 45 s — zero panics across 30k round-trips / 1.2M mutated decodes / 240k garbage decodes. Suite: 745 + 597 tests green incl. new test_arena_reuse_round_trip; fmt/clippy clean; blob round-trips verified for pz2 and pz2d.

Next (task #12 part 2): encode-side concurrent cache inflation (23.7 s, 2.9×) — dict chain caps / sampled insertion.

🤖 Generated with Claude Code

…leneck diagnosis

Wave-2 v1 re-zeroed an 18 MiB buffer and re-copied the 16 MiB dict for
every block (~3.3 GB traffic/segment fleet). Now pz2::decode_into_arena
takes an arena already holding the dict, never writes below it, and
callers truncate-and-reuse it; decode_with_prefix is a thin wrapper.
The container's wave 2 fans out narrowly (parallelism/4 workers per
segment, strided blocks) so each worker pays one amortized dict copy,
and wave 1 decodes the chain directly into the output vec.

Blob (warm-cache hyperfine): pz2d 42.1 -> 36.6 ms at unchanged 30.48%,
user CPU 250 -> 190 ms, sys 72 -> 40 ms. pz2 non-dict path unchanged
(17.8 ms, re-anchored same day) after fixing an allocation gotcha:
fresh decode buffers must use vec![0; n] (calloc, lazy faulting), not
with_capacity + resize (explicit memset) — that alone was +6 ms.

The 17-20 ms projection in §11b was wrong: thread/worker sweeps show
decode is DRAM-latency-bound on the format's random dict reads
(~5 GB/s aggregate at a hard ~40 ms floor across t=6..18), not
traffic-bound. W=1 (traffic-equivalent to a two-region splice) does
not beat W=4, so the two-region splice is recorded as not worth
building. A 4 MiB-dict build measures 30.73% / 31.3 ms / 20.1 s — a
real curve point if a dict-size header is ever added; 16 MiB stays
shipped as the max-ratio tier. Ledger: clean-slate-codec.md §11c.

Soaked: release 180 s + 120 s on final code, debug 45 s, zero panics.
Suite: 745 + 597 green, fmt/clippy clean, blob round-trips verified.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit d1f82c9 into master Jun 10, 2026
4 checks passed
@ChrisLundquist ChrisLundquist deleted the claude/pz2d-arena-decode branch June 10, 2026 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant