Skip to content

perf(bw): K=8 multi-cursor inverse BWT — 1.55x ST / 2.66x all-cores CLI decode#151

Merged
ChrisLundquist merged 1 commit into
masterfrom
claude/bw-multicursor-ibwt
Jun 10, 2026
Merged

perf(bw): K=8 multi-cursor inverse BWT — 1.55x ST / 2.66x all-cores CLI decode#151
ChrisLundquist merged 1 commit into
masterfrom
claude/bw-multicursor-ibwt

Conversation

@ChrisLundquist

Copy link
Copy Markdown
Owner

Productionizes the CPU multi-cursor inverse-BWT rider from the claude/spike-ibwt-cursors spike (findings doc included in this PR: docs/design-docs/ibwt-cursor-findings.md).

What

The serial LF chase in inverse BWT is latency-bound: one dependent load per output byte into a block-sized working set. This PR runs K=8 interleaved LF cursors per bw block — each decoding its own 1/8 output segment backwards over a packed LF array ((byte << 24) | lf, one load per step) — converting that load latency into memory-level parallelism.

Deriving the cursor start states at decode time is a measured dead end (the derivation is a full serial chase), so the encoder computes the K−1 start samples from the suffix array (free at encode time) and stores them in the block header.

  • K=8, deliberately not 16: reproducible aarch64 register-spill cliff at K=16/32 (single-thread chase 788 → 220 MB/s in the spike).
  • Handles any block length (segment boundaries (j+1)·n/8, ragged tail ≤ 1 step/cursor) — no divisibility requirement.

Wire format (versioned, backward compatible)

bw block header gains a flag bit, mirroring the BW_ZRLE_FLAG retrofit pattern:

[primary_index | BW_CURSORS_FLAG?: u32] [rle_len | BW_ZRLE_FLAG?: u32]
[cursor samples: 7 × u32, iff BW_CURSORS_FLAG] [fse_data…]
  • BW_CURSORS_FLAG = bit 31 of the primary_index field (always free: the index is a block position ≪ 2³¹).
  • Samples are emitted for blocks in [4 KiB, 16 MiB]; everything else stays byte-identical to the legacy format.
  • Legacy streams decode unchanged through the serial single-cursor path. Regression-tested two ways: a checked-in container fixture produced by master's actual binary (src/pipeline/testdata/bw-legacy-master-698193a.pz), and a flag-stripped stage-level test.
  • GPU-encoded blocks (no suffix array exposed) derive samples via a one-off serial chase at encode time, so all new blocks carry samples regardless of backend.

Measured (Silesia 211.9 MB blob, CLI end-to-end pz -p bw -d, M5 Max)

7 reps/config, median, master (698193a) vs this branch back-to-back, CPU-only builds:

Config master this PR speedup spread
decode, 1 thread 2.867 s (74 MB/s) 1.846 s (115 MB/s) 1.55x ≤ 0.7%
decode, all cores 0.412 s (514 MB/s) 0.155 s (1367 MB/s) 2.66x ≤ 1.7%

(One earlier all-cores master run showed 0.43–0.60 s under load from sibling agents; the table uses the clean back-to-back run. ST numbers were stable across all runs.)

The spike's 2.6x ST / 4.2x all-cores figures were for the iBWT stage in isolation (LF build + chase); the whole-pipeline CLI numbers above are diluted by the FSE/MTF/zRLE stages, which are untouched.

Ratio cost: exactly 28 B per 1 MiB block — +5,796 bytes total on Silesia = +0.00273% of raw, +0.0099% of compressed. Per-file ratios unchanged to 2 decimals (blob 27.79%). Encode time unchanged within noise (sample readoff is a binary search inside the existing SA scan).

Verification: all 12 Silesia files round-trip byte-exactly (new→new and legacy-stream→new decoder), plus edge cases (empty, 1 byte, n < K, 4095/4096/4097 around the sampling gate, prime lengths, all-zeros 1 MiB block, multi-block ragged tail). Full ./scripts/test.sh suite green.

Scoped out

  • bbw: also LF-based, but inverts per Lyndon factor (primary index 0 per factor), so it needs a per-factor sampling scheme (variable sample counts on the wire) or factor-interleaving in the decoder — nontrivial; left as follow-up, noted in the findings doc.
  • Packed-LF single-pass build is the new ST bottleneck per the spike (418 MB/s); follow-up lever.
  • GPU iBWT: passed its spike gate but rejected on end-to-end Pareto grounds — see findings doc.

🤖 Generated with Claude Code

…LI decode

The serial LF chase is latency-bound (one dependent load per byte). Run 8
interleaved cursors per bw block over a packed LF array ((byte<<24)|lf, one
load per step), each decoding its own 1/8 output segment backwards — the
load latency becomes memory-level parallelism. K=8, not 16: reproducible
aarch64 register-spill cliff at K=16/32.

Cursor start states cannot be derived at decode time (the derivation is
itself a full serial chase — measured), so the encoder reads them off the
suffix array (free) and stores K-1=7 u32 samples in the block header behind
BW_CURSORS_FLAG (bit 31 of the primary_index field, mirroring the
BW_ZRLE_FLAG retrofit). Legacy streams decode unchanged via the serial
path; regression-tested against a container fixture produced by master's
binary. GPU-encoded blocks derive samples via a one-off encode-time chase.

Silesia blob (211.9 MB, CLI end-to-end, median of 7): decode 2.867->1.846 s
single-thread (1.55x), 0.412->0.155 s all-cores (2.66x). Wire cost 28 B per
1 MiB block = +0.00273% raw; ratio unchanged at 27.8%. Encode unchanged
within noise. bbw (per-factor inversion) scoped out as follow-up.

Spike: docs/design-docs/ibwt-cursor-findings.md (branch claude/spike-ibwt-cursors).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ChrisLundquist ChrisLundquist merged commit 41d8951 into master Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant