perf(bw): K=8 multi-cursor inverse BWT — 1.55x ST / 2.66x all-cores CLI decode by ChrisLundquist · Pull Request #151 · ChrisLundquist/libpz

ChrisLundquist · 2026-06-10T19:32:06Z

Productionizes the CPU multi-cursor inverse-BWT rider from the claude/spike-ibwt-cursors spike (findings doc included in this PR: docs/design-docs/ibwt-cursor-findings.md).

What

The serial LF chase in inverse BWT is latency-bound: one dependent load per output byte into a block-sized working set. This PR runs K=8 interleaved LF cursors per bw block — each decoding its own 1/8 output segment backwards over a packed LF array ((byte << 24) | lf, one load per step) — converting that load latency into memory-level parallelism.

Deriving the cursor start states at decode time is a measured dead end (the derivation is a full serial chase), so the encoder computes the K−1 start samples from the suffix array (free at encode time) and stores them in the block header.

K=8, deliberately not 16: reproducible aarch64 register-spill cliff at K=16/32 (single-thread chase 788 → 220 MB/s in the spike).
Handles any block length (segment boundaries (j+1)·n/8, ragged tail ≤ 1 step/cursor) — no divisibility requirement.

Wire format (versioned, backward compatible)

bw block header gains a flag bit, mirroring the BW_ZRLE_FLAG retrofit pattern:

[primary_index | BW_CURSORS_FLAG?: u32] [rle_len | BW_ZRLE_FLAG?: u32]
[cursor samples: 7 × u32, iff BW_CURSORS_FLAG] [fse_data…]

BW_CURSORS_FLAG = bit 31 of the primary_index field (always free: the index is a block position ≪ 2³¹).
Samples are emitted for blocks in [4 KiB, 16 MiB]; everything else stays byte-identical to the legacy format.
Legacy streams decode unchanged through the serial single-cursor path. Regression-tested two ways: a checked-in container fixture produced by master's actual binary (src/pipeline/testdata/bw-legacy-master-698193a.pz), and a flag-stripped stage-level test.
GPU-encoded blocks (no suffix array exposed) derive samples via a one-off serial chase at encode time, so all new blocks carry samples regardless of backend.

Measured (Silesia 211.9 MB blob, CLI end-to-end `pz -p bw -d`, M5 Max)

7 reps/config, median, master (698193a) vs this branch back-to-back, CPU-only builds:

Config	master	this PR	speedup	spread
decode, 1 thread	2.867 s (74 MB/s)	1.846 s (115 MB/s)	1.55x	≤ 0.7%
decode, all cores	0.412 s (514 MB/s)	0.155 s (1367 MB/s)	2.66x	≤ 1.7%

(One earlier all-cores master run showed 0.43–0.60 s under load from sibling agents; the table uses the clean back-to-back run. ST numbers were stable across all runs.)

The spike's 2.6x ST / 4.2x all-cores figures were for the iBWT stage in isolation (LF build + chase); the whole-pipeline CLI numbers above are diluted by the FSE/MTF/zRLE stages, which are untouched.

Ratio cost: exactly 28 B per 1 MiB block — +5,796 bytes total on Silesia = +0.00273% of raw, +0.0099% of compressed. Per-file ratios unchanged to 2 decimals (blob 27.79%). Encode time unchanged within noise (sample readoff is a binary search inside the existing SA scan).

Verification: all 12 Silesia files round-trip byte-exactly (new→new and legacy-stream→new decoder), plus edge cases (empty, 1 byte, n < K, 4095/4096/4097 around the sampling gate, prime lengths, all-zeros 1 MiB block, multi-block ragged tail). Full ./scripts/test.sh suite green.

Scoped out

bbw: also LF-based, but inverts per Lyndon factor (primary index 0 per factor), so it needs a per-factor sampling scheme (variable sample counts on the wire) or factor-interleaving in the decoder — nontrivial; left as follow-up, noted in the findings doc.
Packed-LF single-pass build is the new ST bottleneck per the spike (418 MB/s); follow-up lever.
GPU iBWT: passed its spike gate but rejected on end-to-end Pareto grounds — see findings doc.

🤖 Generated with Claude Code

…LI decode The serial LF chase is latency-bound (one dependent load per byte). Run 8 interleaved cursors per bw block over a packed LF array ((byte<<24)|lf, one load per step), each decoding its own 1/8 output segment backwards — the load latency becomes memory-level parallelism. K=8, not 16: reproducible aarch64 register-spill cliff at K=16/32. Cursor start states cannot be derived at decode time (the derivation is itself a full serial chase — measured), so the encoder reads them off the suffix array (free) and stores K-1=7 u32 samples in the block header behind BW_CURSORS_FLAG (bit 31 of the primary_index field, mirroring the BW_ZRLE_FLAG retrofit). Legacy streams decode unchanged via the serial path; regression-tested against a container fixture produced by master's binary. GPU-encoded blocks derive samples via a one-off encode-time chase. Silesia blob (211.9 MB, CLI end-to-end, median of 7): decode 2.867->1.846 s single-thread (1.55x), 0.412->0.155 s all-cores (2.66x). Wire cost 28 B per 1 MiB block = +0.00273% raw; ratio unchanged at 27.8%. Encode unchanged within noise. bbw (per-factor inversion) scoped out as follow-up. Spike: docs/design-docs/ibwt-cursor-findings.md (branch claude/spike-ibwt-cursors). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

ChrisLundquist merged commit 41d8951 into master Jun 10, 2026
4 checks passed

ChrisLundquist mentioned this pull request Jun 10, 2026

docs: maintainer pass — consume feedback backlog, reconcile GPU dead-ends + stale numbers #156

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(bw): K=8 multi-cursor inverse BWT — 1.55x ST / 2.66x all-cores CLI decode#151

perf(bw): K=8 multi-cursor inverse BWT — 1.55x ST / 2.66x all-cores CLI decode#151
ChrisLundquist merged 1 commit into
masterfrom
claude/bw-multicursor-ibwt

ChrisLundquist commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ChrisLundquist commented Jun 10, 2026

What

Wire format (versioned, backward compatible)

Measured (Silesia 211.9 MB blob, CLI end-to-end pz -p bw -d, M5 Max)

Scoped out

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Measured (Silesia 211.9 MB blob, CLI end-to-end `pz -p bw -d`, M5 Max)