perf(pz2d): arena decode — 36.6 ms blob decode (−13%), corrected bottleneck diagnosis#150
Merged
Merged
Conversation
…leneck diagnosis Wave-2 v1 re-zeroed an 18 MiB buffer and re-copied the 16 MiB dict for every block (~3.3 GB traffic/segment fleet). Now pz2::decode_into_arena takes an arena already holding the dict, never writes below it, and callers truncate-and-reuse it; decode_with_prefix is a thin wrapper. The container's wave 2 fans out narrowly (parallelism/4 workers per segment, strided blocks) so each worker pays one amortized dict copy, and wave 1 decodes the chain directly into the output vec. Blob (warm-cache hyperfine): pz2d 42.1 -> 36.6 ms at unchanged 30.48%, user CPU 250 -> 190 ms, sys 72 -> 40 ms. pz2 non-dict path unchanged (17.8 ms, re-anchored same day) after fixing an allocation gotcha: fresh decode buffers must use vec![0; n] (calloc, lazy faulting), not with_capacity + resize (explicit memset) — that alone was +6 ms. The 17-20 ms projection in §11b was wrong: thread/worker sweeps show decode is DRAM-latency-bound on the format's random dict reads (~5 GB/s aggregate at a hard ~40 ms floor across t=6..18), not traffic-bound. W=1 (traffic-equivalent to a two-region splice) does not beat W=4, so the two-region splice is recorded as not worth building. A 4 MiB-dict build measures 30.73% / 31.3 ms / 20.1 s — a real curve point if a dict-size header is ever added; 16 MiB stays shipped as the max-ratio tier. Ledger: clean-slate-codec.md §11c. Soaked: release 180 s + 120 s on final code, debug 45 s, zero panics. Suite: 745 + 597 green, fmt/clippy clean, blob round-trips verified. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Task #12 part 1: the Pz2d arena decode.
pz2::decode_into_arenais the new splice core — it takes an arena that already holds the dict, never writes below it, and lets callers truncate-and-reuse it across blocks.decode_with_prefixis now a thin wrapper (zero behavior change for every existing caller). The container's wave 2 fans out narrowly (available_parallelism/4workers per segment, strided blocks) so each worker pays one dict copy amortized over its share; wave 1 decodes the dict chain directly into the output vec with zero prefix copies.Measured (202 MB Silesia blob, warm-cache hyperfine, M5 Max)
pz2's non-dict path is wall-identical after fixing a self-inflicted allocation gotcha: fresh decode buffers must come from
vec, notwith_capacity+resize(explicit memset, double page touch) — that alone was +6 ms. Feedback note included.The diagnosis correction (the real finding)
§11b projected ~17-20 ms assuming decode was memory-traffic-bound. It is DRAM-latency-bound on the format's random dict reads: outer-thread sweep hits a hard ~40 ms floor (t=6: 40.6, t=7: 41.9, t=18: 46.0) at only ~5 GB/s aggregate; ST per-segment is 16.8 ms (on model) with 2.4× concurrent inflation. W=1 inner workers — traffic-equivalent to the two-region splice — does not beat W=4, so the two-region splice is recorded as not worth its unsafe complexity. A 4 MiB-dict build measures 30.73% / 31.3 ms / 20.1 s enc (curve point if a dict-size header field is ever added; 16 MiB stays shipped as the max-ratio tier). Full ledger in clean-slate-codec.md §11c.
Safety
The unsafe wildcopy splice changed (buffer ownership, not copy discipline — all bounds validations and slack proofs preserved, invariant comments re-derived; the dict region is provably never stored to). Soaked: release 180 s + 120 s on final code, debug 45 s — zero panics across 30k round-trips / 1.2M mutated decodes / 240k garbage decodes. Suite: 745 + 597 tests green incl. new
test_arena_reuse_round_trip; fmt/clippy clean; blob round-trips verified for pz2 and pz2d.Next (task #12 part 2): encode-side concurrent cache inflation (23.7 s, 2.9×) — dict chain caps / sampled insertion.
🤖 Generated with Claude Code