perf(bw): K=8 multi-cursor inverse BWT — 1.55x ST / 2.66x all-cores CLI decode#151
Merged
Merged
Conversation
…LI decode The serial LF chase is latency-bound (one dependent load per byte). Run 8 interleaved cursors per bw block over a packed LF array ((byte<<24)|lf, one load per step), each decoding its own 1/8 output segment backwards — the load latency becomes memory-level parallelism. K=8, not 16: reproducible aarch64 register-spill cliff at K=16/32. Cursor start states cannot be derived at decode time (the derivation is itself a full serial chase — measured), so the encoder reads them off the suffix array (free) and stores K-1=7 u32 samples in the block header behind BW_CURSORS_FLAG (bit 31 of the primary_index field, mirroring the BW_ZRLE_FLAG retrofit). Legacy streams decode unchanged via the serial path; regression-tested against a container fixture produced by master's binary. GPU-encoded blocks derive samples via a one-off encode-time chase. Silesia blob (211.9 MB, CLI end-to-end, median of 7): decode 2.867->1.846 s single-thread (1.55x), 0.412->0.155 s all-cores (2.66x). Wire cost 28 B per 1 MiB block = +0.00273% raw; ratio unchanged at 27.8%. Encode unchanged within noise. bbw (per-factor inversion) scoped out as follow-up. Spike: docs/design-docs/ibwt-cursor-findings.md (branch claude/spike-ibwt-cursors). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Productionizes the CPU multi-cursor inverse-BWT rider from the
claude/spike-ibwt-cursorsspike (findings doc included in this PR:docs/design-docs/ibwt-cursor-findings.md).What
The serial LF chase in inverse BWT is latency-bound: one dependent load per output byte into a block-sized working set. This PR runs K=8 interleaved LF cursors per
bwblock — each decoding its own 1/8 output segment backwards over a packed LF array ((byte << 24) | lf, one load per step) — converting that load latency into memory-level parallelism.Deriving the cursor start states at decode time is a measured dead end (the derivation is a full serial chase), so the encoder computes the K−1 start samples from the suffix array (free at encode time) and stores them in the block header.
(j+1)·n/8, ragged tail ≤ 1 step/cursor) — no divisibility requirement.Wire format (versioned, backward compatible)
bwblock header gains a flag bit, mirroring theBW_ZRLE_FLAGretrofit pattern:BW_CURSORS_FLAG= bit 31 of theprimary_indexfield (always free: the index is a block position ≪ 2³¹).[4 KiB, 16 MiB]; everything else stays byte-identical to the legacy format.src/pipeline/testdata/bw-legacy-master-698193a.pz), and a flag-stripped stage-level test.Measured (Silesia 211.9 MB blob, CLI end-to-end
pz -p bw -d, M5 Max)7 reps/config, median, master (698193a) vs this branch back-to-back, CPU-only builds:
(One earlier all-cores master run showed 0.43–0.60 s under load from sibling agents; the table uses the clean back-to-back run. ST numbers were stable across all runs.)
The spike's 2.6x ST / 4.2x all-cores figures were for the iBWT stage in isolation (LF build + chase); the whole-pipeline CLI numbers above are diluted by the FSE/MTF/zRLE stages, which are untouched.
Ratio cost: exactly 28 B per 1 MiB block — +5,796 bytes total on Silesia = +0.00273% of raw, +0.0099% of compressed. Per-file ratios unchanged to 2 decimals (blob 27.79%). Encode time unchanged within noise (sample readoff is a binary search inside the existing SA scan).
Verification: all 12 Silesia files round-trip byte-exactly (new→new and legacy-stream→new decoder), plus edge cases (empty, 1 byte, n < K, 4095/4096/4097 around the sampling gate, prime lengths, all-zeros 1 MiB block, multi-block ragged tail). Full
./scripts/test.shsuite green.Scoped out
🤖 Generated with Claude Code