perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis)#50
perf(validate): SIMD UTF-8 validator + measurement infrastructure (draft for analysis)#50membphis wants to merge 14 commits into
Conversation
Mirrors medium_resp.json byte-for-byte (60368 B) but replaces the content field with 15000 × "中 " repetitions. The repeated 3-byte BMP CJK character forces the AVX2 string validator off its ASCII fast path on every chunk, exposing the scalar fallback cost that the upcoming SIMD UTF-8 validator targets.
Measures Document::parse_with_options end-to-end across ASCII and CJK fixtures in both EAGER and LAZY mode (4 benches total). Throughput is reported in MB/s. The eager-vs-lazy delta per fixture is the value-level validation cost that future SIMD optimizations target; the ASCII benches serve as a regression guard.
make bench now runs both Rust criterion and Lua-vs-cjson, matching the composition pattern used by make test. make bench-rust is the inner-loop target for SIMD tuning; make bench-lua preserves the existing user-facing comparison harness behavior unchanged.
make bench is now the composite suite; make bench-lua preserves the prior Lua-vs-cjson behavior, which is what the benchmarks page documents. make bench-rust is the new Rust criterion entry point.
cargo build --release --benches catches stale bench source on every PR without paying the cost (or accepting the non-determinism) of actually running benchmarks in CI.
Reflects make bench composition + the bench-rust / bench-lua sub-targets, and lists the new medium_resp_cjk.json fixture under benches/. Caught by the final review on the bench-infrastructure PR.
proptest property: validate_span_scalar and validate_span_avx2 must return byte-identical Result for any byte sequence (2000 cases per CI run). Mirrors tests/scanner_crosscheck.rs pattern. Passes on current code (AVX2 falls back to scalar on non-ASCII, so trivially identical); will catch any divergence introduced by the upcoming SIMD UTF-8 validator rewrite.
Three point tests for the three UTF-8 error classes the validator must reject: truncated multi-byte sequence (0xC3 with no continuation), overlong encoding (0xC0 0x80 = U+0000 in 2 bytes), and UTF-16 surrogate (0xED 0xA0 0x80 = U+D800). Tests pass on the current scalar fallback; they guard against regressions in the upcoming SIMD validator rewrite.
medium_resp_mixed.json: 60368 B, content cycles "中 hello é world 😀 "
(24 B/cycle × 2500 = 60000 B), exercising 1/2/3/4-byte UTF-8 sequences
with frequent script transitions.
medium_resp_emoji.json: 60368 B, content is "😀 " × 12000 (5 B/cycle),
exercising the 4-byte UTF-8 lookup4 path under maximum pressure (80%
high-bit ratio).
Same skeleton and total byte count as medium_resp{,_cjk}.json so the
bench's MB/s numbers are directly comparable across all four fixtures.
Adds parse/mixed/{eager,lazy} and parse/emoji/{eager,lazy} entries.
Mixed exercises script-transition cost in the validator; emoji
exercises 4-byte UTF-8 pressure. Lazy benches confirm scanner path is
content-agnostic (all four should land within ±3% of each other).
Numbers from this commit form the pre-simd baseline against which the
upcoming AVX2 validator rewrite will be measured.
Replaces the AVX2 fast-path-and-fallback layer with a three-tier
dispatch:
Tier 1: pure printable ASCII chunks skip wholesale (unchanged).
Tier 2: pure UTF-8 chunks (no control/backslash) run Lemire/Keiser
lookup4 — 5 SIMD ops/chunk, no scalar fallback.
Tier 3: chunks with control byte or backslash flush the lookup4
carry, then hand off to the scalar state machine.
Carry state across chunks: prev_input (256-bit) feeds shift-with-carry
into lookup4's prev1/prev2/prev3 inputs. err_acc accumulates per-chunk
errors; checked at chunk boundary before Tier 3 handoff and at end of
main loop. prev_ended_ascii flag is the safety interlock for Tier 1.
Lookup tables transcribed verbatim from simdjson's
utf8_lookup4_algorithm.h (Lemire & Keiser, 2020). Correctness verified
by tests/string_validate_crosscheck.rs (2000 proptest cases vs scalar
oracle) and three explicit bad-UTF-8 reject fixtures.
proptest captured the failing seed [0x00 × 30, 0x80, 0x00] that surfaced during the AVX2 lookup4 development — initial transcription of simdjson's tables had errors that this input revealed within seconds of running the cross-check. Committing the seed ensures the exact bug-pattern is replayed on every future run.
Three code-quality fixes from review of the AVX2 lookup4 commit: - lookup4_chunk no longer re-broadcasts BYTE1_HIGH / BYTE1_LOW / BYTE2_HIGH on every call; the three table vectors are computed once at the top of validate_span_avx2_impl and passed in. - prev_ended_ascii's safety role at the Tier 1 skip is now documented inline (was only in the module-level doc). - The 1-3 byte carry-prefix extraction from prev_input was duplicated in the Tier 3 fallback and the post-loop tail; extracted into a private extract_carry_prefix helper so both sites share one source of truth. No behavior change; cross-check property test still passes 2000 cases.
The three-tier dispatch in 47b8fb5 broke the compiler's ability to collapse the high/ctrl/bs masks into a single vpmovmskb, regressing parse/ascii/eager by ~24% on Zen 2 (port-0-only vpmovmskb at 1/cycle was the bottleneck). Fix: in the inner loop, build a 'cb_v = ctrl|bs' vector first, then detect any interesting byte (ctrl|bs|high) via vpor(cb_v, chunk_raw) followed by a single vpmovmskb. Only on the slow path do we compute a second movemask on cb_v to disambiguate Tier 2 (pure UTF-8) from Tier 3 (control byte or backslash). ASCII inner loop: back to 1 vpmovmskb per chunk. CJK / mixed / emoji paths unchanged (they take the slow path on every chunk, but the single extra movemask there is dwarfed by lookup4's cost). Cross-check property test still passes 2000 cases against scalar.
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Performance analysis summary: the 2-4× CJK/mixed/emoji targets are unattainable with SIMD UTF-8 validation alone. Root causeLAZY mode (~40 GiB/s on all fixtures) is the hard upper bound for EAGER. The structural scan + depth check alone costs ~1.5 μs for 60 KB. EAGER mode spends ~54.5 μs in On uniform CJK, the scalar validator already runs at ~1 cycle/byte. The CPU achieves this via perfect branch prediction (every byte is a 3-byte lead → same code path every time) and OOO execution running multiple iterations in flight. SIMD lookup4 does ~26 ops per 32 bytes ≈ 1.6 cycles/byte on Zen 2 after accounting for 3× cross-lane permutes ( What could still be squeezed out
The 4× spec target was based on an assumption that scalar is "slow" — but on uniform non-ASCII data, scalar is actually very close to optimal due to branch prediction + OOO. There is no architectural path to 2-4×. RecommendationKeep the bench infrastructure commits (valuable), revert or shelve the SIMD validator commits. If CJK parse throughput is a priority, the real 40× leverage is LAZY mode (40 GiB/s) — validate on access rather than eagerly. |
|
Closing — performance ceiling fully characterized. See analysis in the comment above. |
Status
Draft for performance analysis. Correctness is verified across all paths; the AVX2 SIMD validator's CJK speedup landed at ~14% rather than the 4× target in the spec. Handing off for deeper analysis before deciding the path forward.
Summary
This branch combines two logically related sub-projects (~14 commits, would normally be 2 stacked PRs):
Bench infrastructure (commits
2db82d7..151d0ea, 6 commits)benches/fixtures/medium_resp_cjk.json, 60368 B byte-aligned with the existing ASCII fixture)benches/parse_eager.rs) — measuresDocument::parse_with_optionsend-to-end on the 4-fixture × 2-mode matrixbenchis now composite overbench-rust+bench-lua)cargo build --release --benchesstep to prevent bit-rotSIMD UTF-8 validator (commits
c0aaedd..7e77b51, 8 commits)tests/string_validate_crosscheck.rs, proptest 2000 cases) — scalar ≡ avx2 on arbitrary byte sequencesBench results (Zen 2 AMD EPYC-Rome, 4 vCPU)
Numbers compared against the pre-SIMD baseline saved at commit `f7f07af`:
ASCII passes. CJK / mixed / emoji eager paths improve, but far below the 4× / 2× / 2× targets in the spec.
What we know about the gap
A debug investigation (see `tests/string_validate_crosscheck.proptest-regressions` for the captured regression seed that helped pin down the algorithm):
Cross-lane permute cost on Zen 2. `lookup4` uses 3 `_mm256_permute2x128_si256` per chunk for prev1/prev2/prev3 shifts. Zen 2's cross-lane permute is ~3 cycle latency / 1 cycle throughput. Roughly ~26 SIMD ops per Tier-2 chunk in total.
Scalar fallback is faster than expected on uniform CJK content. Branch predictor handles the regular 3-byte sequence pattern well; effective ~1 cycle/byte. The "scalar baseline is slow" assumption that justifies SIMD validation isn't holding up on this specific workload + CPU.
The 24% ASCII regression (now fixed) was traced to the dispatch losing the compiler's mask-fusion optimization. Pre-SIMD code used the signed-cmpgt trick to detect ctrl + high-bit bytes in one `vpmovmskb`; the new 3-tier dispatch broke that by needing `high` separated from `ctrl|bs`. The fix in `7e77b51` restores the fusion via `vpor(cb_v, chunk_raw)` before the single movemask, with a second movemask only on the slow path.
What the analyst might look at
Correctness verification
Test plan
🤖 Generated with Claude Code