perf: fuse eager validation passes + PSHUFB string classifier#52
perf: fuse eager validation passes + PSHUFB string classifier#52membphis wants to merge 17 commits into
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Merge validate_depth, validate_trailing, and validate_eager_values into a single fused pass. Replace AVX2 string validation with PSHUFB nibble-LUT byte classifier. Add AVX-512 dual path. Add SIMD number validation.
8 tasks: PSHUFB classifier, AVX2 string rewrite, AVX-512 path, SIMD number validation, pass fusion, doc.rs wiring, full test verification, CLAUDE.md update.
Add #[allow(dead_code)] to classify.rs module and to validate_trailing / validate_eager_values (kept for tests and planned future use). Fix overindented doc list item.
There was a problem hiding this comment.
Pull request overview
This PR optimizes qjson’s eager validation path by (1) fusing multiple post-scan validation passes into a single traversal over the scanner’s indices, and (2) replacing the AVX2 string fast-path with a PSHUFB (nibble-LUT) byte classifier shared across validation routines.
Changes:
- Add
validate_eager_fusedto merge depth checking, trailing-content detection, and grammar/value validation into one O(indices) walk, and wire it intoDocument::parse_with_optionsfor eager mode. - Introduce a new PSHUFB nibble-LUT classifier module and rewrite AVX2 string validation to use its attention mask.
- Add design/plan documentation for the SIMD + pass-fusion work; update
.gitignore.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/validate/strings/avx2.rs |
Reworks AVX2 string-span validation to iterate “interesting” bytes using the classifier-derived mask. |
src/validate/mod.rs |
Adds validate_eager_fused and test coverage for fused behavior; keeps older passes for lazy mode/tests. |
src/validate/classify.rs |
New shared PSHUFB nibble-LUT classifier + exhaustive LUT correctness tests. |
src/doc.rs |
Switches eager parsing to the fused validator; retains lazy depth-only validation. |
docs/superpowers/specs/2026-05-22-fuse-eager-simd-design.md |
Design spec describing pass fusion + SIMD classifier approach. |
docs/superpowers/plans/2026-05-22-fuse-eager-simd-plan.md |
Implementation plan/checklist for the overall optimization effort. |
.gitignore |
Adds an ignore rule for docs/superpowers/specs/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| strings::validate_string_span(&buf[pos + 1 .. close])?; | ||
|
|
||
| let cur = stack.last_mut().ok_or(qjson_err::QJSON_PARSE_ERROR)?; | ||
| match *cur { | ||
| CtxKind::ObjAfterOpen | CtxKind::ObjAfterComma => { | ||
| *cur = CtxKind::ObjAfterKey; | ||
| } | ||
| CtxKind::Top | ||
| | CtxKind::ArrAfterOpen | ||
| | CtxKind::ArrAfterComma | ||
| | CtxKind::ObjAfterColon => { |
| // token, validate it, then check for trailing content beyond it. | ||
| if matches!(*stack.last().unwrap(), CtxKind::Top) && depth == 0 { | ||
| let mut scan = prev_end; | ||
| while scan < buf.len() && is_ws(buf[scan]) { scan += 1; } | ||
| if scan < buf.len() { | ||
| let mut end = scan; | ||
| while end < buf.len() && !is_ws(buf[end]) { end += 1; } | ||
| validate_scalar(&buf[scan..end])?; | ||
| *stack.last_mut().unwrap() = CtxKind::TopDone; | ||
|
|
||
| let mut p = end; | ||
| while p < buf.len() && is_ws(buf[p]) { p += 1; } | ||
| if p < buf.len() { | ||
| return Err(qjson_err::QJSON_TRAILING_CONTENT); | ||
| } |
| pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 { | ||
| let lo_lut = make_lut(&STR_LO_TABLE); | ||
| let hi_lut = make_lut(&STR_HI_TABLE); | ||
| let classes = classify_chunk(chunk, lo_lut, hi_lut); |
| The eager decode path (`Document::parse_with_options` in `src/doc.rs`) runs **4 independent passes** over the `indices` array after structural scanning: | ||
|
|
||
| 1. `validate_depth` — depth counting | ||
| 2. `validate_trailing` — reject trailing non-whitespace | ||
| 3. `validate_eager_values` — grammar state machine + string validation + number validation | ||
|
|
||
| Each pass is a scalar O(indices) walk. Additionally, string validation SIMD (`strings/avx2.rs`) is conservative: it hands off to scalar on the *first* interesting byte found (backslash, control, or high-bit), leaving most of the SIMD register width unused on mixed content. Number validation has no SIMD path at all. |
| | 3 | Printable ASCII (0x20..0x7E, excluding backslash) | | ||
|
|
||
| **Algorithm per 32-byte chunk:** | ||
| 1. Split each byte into high-nibble and low-nibble via shift + mask. | ||
| 2. `_mm256_shuffle_epi8(lo_nibble, lo_lut)` and `_mm256_shuffle_epi8(hi_nibble, hi_lut)`. | ||
| 3. AND low and high LUT results → per-byte class bitmask. | ||
| 4. If any bit 0 set → `QJSON_INVALID_STRING` (control char). | ||
| 5. If bits 1 and 2 are zero → pure printable ASCII, advance 32 bytes. | ||
| 6. Otherwise: scan class bitmask for backslash positions, validate escape sequences; for high-bit bytes, run SIMD-enhanced UTF-8 validation. | ||
|
|
||
| Key improvement: the classifier tells us **exactly which bytes need what kind of attention**, rather than a binary "there's a problem here". Multiple backslashes in one chunk are all located without re-scanning. High-bit bytes are identified by position, enabling batch UTF-8 validation. |
ddb53eb to
c62b28d
Compare
- README: update Status section with fused validation + PSHUFB description - benchmarks.md: add eager validation micro-benchmark section (13-15% improvement), update observation #3 - CLAUDE.md: update Phase 1 architecture and Layout sections
Note 13-15% eager validation improvement from fused pass + PSHUFB classifier. Clarify that Lua bench numbers already include this.
| } else { | ||
| CtxKind::ArrAfterOpen | ||
| }); | ||
| } |
| pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 { | ||
| let lo_lut = make_lut(&STR_LO_TABLE); | ||
| let hi_lut = make_lut(&STR_HI_TABLE); | ||
| let classes = classify_chunk(chunk, lo_lut, hi_lut); | ||
| let zero = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256()); | ||
| let zero_mask = _mm256_movemask_epi8(zero) as u32; | ||
| zero_mask ^ 0xFFFF_FFFF // invert: 1 = interesting | ||
| } |
Update README and benchmarks.md tables with actual throughput, speedup, and memory delta numbers from make bench on current branch (includes fused validation + decode deduplication).
- Increase ROUNDS from 5 to 10 for noise reduction - Switch from median to mean ops/s across rounds - Update all 3 tables (throughput, speedup, memory) with fresh make bench data on current branch
| pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 { | ||
| let lo_lut = make_lut(&STR_LO_TABLE); | ||
| let hi_lut = make_lut(&STR_HI_TABLE); | ||
| let classes = classify_chunk(chunk, lo_lut, hi_lut); |
| | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | | ||
| |---|---:|---:|---:|---:|---:|---:| | ||
| | small | 2.1 KB | 94,075 | 108,108 | 127,214 | 120,398 | 203,666 | | ||
| | medium | 60.4 KB | 9,041 | 83,043 | 123,487 | 214,500 | 214,408 | | ||
| | github-100k | 100 KB | 2,238 | 2,047 | 6,010 | 5,994 | 6,701 | | ||
| | 100k | 100 KB | 5,302 | 32,248 | 109,649 | 102,564 | 114,548 | | ||
| | 200k | 200 KB | 2,659 | 19,040 | 90,090 | 92,251 | 106,383 | | ||
| | 500k | 500 KB | 1,052 | 7,062 | 34,722 | 35,336 | 37,453 | | ||
| | 1m | 1.00 MB | 517 | 3,538 | 16,520 | 16,988 | 17,261 | | ||
| | 2m | 2.00 MB | 258 | 2,026 | 9,021 | 8,580 | 9,033 | | ||
| | 5m | 5.00 MB | 102 | 663 | 2,982 | 3,728 | 3,829 | | ||
| | 10m | 10.00 MB | 50 | 402 | 1,899 | 1,918 | 1,925 | | ||
| | interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 | | ||
| |---|---|---:|---:|---:|---:|---:|---:| | ||
| | small | 2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 | |
| | 2m | 35.0× | 4.5× | 33.3× | 4.2× | | ||
| | 5m | 29.2× | 4.5× | 36.5× | 5.6× | | ||
| | 10m | 38.0× | 4.7× | 38.4× | 4.8× | | ||
| |---|---|---:|---:|---:|---:| |
| | Scenario | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | | ||
| |---|---:|---:|---:|---:|---:| | ||
| | small | +15,493 | +15,500 | +4,066 | +15,116 | +11,140 | | ||
| | medium | +1,955 | +2,660 | +333 | +1,114 | +1,120 | | ||
| | github-100k | +12,018 | +3,527 | +14 | +536 | +230 | | ||
| | 100k | +485 | +748 | +67 | +692 | +229 | | ||
| | 200k | +392 | +523 | +34 | +346 | +112 | | ||
| | 500k | +577 | +630 | +14 | +139 | +45 | | ||
| | 1m | +1,082 | +1,121 | +10 | +104 | +34 | | ||
| | 2m | +1,155 | +1,248 | +14 | +208 | +45 | | ||
| | 5m | +1,316 | +1,538 | +14 | +400 | +45 | | ||
| | 10m | +1,583 | +2,014 | +14 | +708 | +45 | | ||
| | interleaved | +3,356 | +4,404 | +268 | +2,771 | +897 | | ||
| |---|---|---:|---:|---:|---:|---:| | ||
| | small | -2,359 | +8,055 | +8,159 | +8,643 | +2,701 | |
~100 KB array of objects with Chinese text, emoji, and mixed ASCII/CJK field names. Stresses PSHUFB byte classifier (high-bit bytes) and UTF-8 validation path. Pre: 4,720 ops/s → Post: 5,064 ops/s (+7.3%)
100 KB array of objects with Chinese text and emoji. Stresses PSHUFB UTF-8/high-bit classification. Pre-opt: 4,720 ops/s → Post-opt: 4,605 ops/s. qjson memory delta: +26 KB (cjson: +17 MB).
| pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 { | ||
| let lo_lut = make_lut(&STR_LO_TABLE); | ||
| let hi_lut = make_lut(&STR_HI_TABLE); | ||
| let classes = classify_chunk(chunk, lo_lut, hi_lut); | ||
| let zero = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256()); | ||
| let zero_mask = _mm256_movemask_epi8(zero) as u32; | ||
| zero_mask ^ 0xFFFF_FFFF // invert: 1 = interesting |
| | 5m | 5.00 MB | 102 | 663 | 2,982 | 3,728 | 3,829 | | ||
| | 10m | 10.00 MB | 50 | 402 | 1,899 | 1,918 | 1,925 | | ||
| | interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 | | ||
| |---|---|---:|---:|---:|---:|---:|---:| |
| | 5m | 31.4× | 4.9× | 31.5× | 4.9× | | ||
| | 10m | 29.5× | 3.8× | 31.0× | 4.0× | | ||
|
|
||
| ## Results — memory delta (KB retained after 5 rounds) |
| | 5m | +1,316 | +1,538 | +14 | +400 | +45 | | ||
| | 10m | +1,583 | +2,014 | +14 | +708 | +45 | | ||
| | interleaved | +3,356 | +4,404 | +268 | +2,771 | +897 | | ||
| |---|---|---:|---:|---:|---:|---:| |
| @@ -80,33 +80,35 @@ Numbers below come from one such run. | |||
| Each row is "parse + access request fields" on the named payload. | |||
- Replace safe_sub-based truncation with integer multiples of cjk_body to avoid splitting multi-byte sequences - Skip simdjson for cjk scenario (no_simdjson flag) - Add safe_sub utility (unused now, kept for potential future use) - Update cjk-100k data: qjson.parse mean 5,018 ops/s (2.3x vs cjson)
Remove no_simdjson flag. simdjson mean 2,367 ops/s on cjk-100k. qjson.parse/cjson = 2.3×, qjson.parse/simdjson = 2.1×.
| // token, validate it, then check for trailing content beyond it. | ||
| if matches!(*stack.last().unwrap(), CtxKind::Top) && depth == 0 { | ||
| let mut scan = prev_end; | ||
| while scan < buf.len() && is_ws(buf[scan]) { scan += 1; } | ||
| if scan < buf.len() { | ||
| let mut end = scan; | ||
| while end < buf.len() && !is_ws(buf[end]) { end += 1; } | ||
| validate_scalar(&buf[scan..end])?; | ||
| *stack.last_mut().unwrap() = CtxKind::TopDone; | ||
|
|
||
| let mut p = end; | ||
| while p < buf.len() && is_ws(buf[p]) { p += 1; } | ||
| if p < buf.len() { | ||
| return Err(qjson_err::QJSON_TRAILING_CONTENT); | ||
| } |
| pub(crate) unsafe fn classify_chunk(chunk: __m256i, lo_lut: __m256i, hi_lut: __m256i) -> __m256i { | ||
| let nib_mask = _mm256_set1_epi8(0x0Fu8 as i8); | ||
|
|
||
| let lo_nibs = _mm256_and_si256(chunk, nib_mask); | ||
| let hi_shift = _mm256_srli_epi32::<4>(chunk); | ||
| let hi_nibs = _mm256_and_si256(hi_shift, nib_mask); | ||
|
|
||
| let lo_class = _mm256_shuffle_epi8(lo_lut, lo_nibs); | ||
| let hi_class = _mm256_shuffle_epi8(hi_lut, hi_nibs); | ||
|
|
||
| _mm256_and_si256(lo_class, hi_class) | ||
| } | ||
|
|
||
| /// Build a 32-byte `__m256i` from a 16-entry nibble LUT by duplicating | ||
| /// the table into both 128-bit lanes. | ||
| #[cfg(all(target_arch = "x86_64", feature = "avx2"))] | ||
| unsafe fn make_lut(table: &[u8; 16]) -> __m256i { | ||
| let t = table; | ||
| _mm256_setr_epi8( | ||
| t[0] as i8, t[1] as i8, t[2] as i8, t[3] as i8, | ||
| t[4] as i8, t[5] as i8, t[6] as i8, t[7] as i8, | ||
| t[8] as i8, t[9] as i8, t[10] as i8, t[11] as i8, | ||
| t[12] as i8, t[13] as i8, t[14] as i8, t[15] as i8, | ||
| t[0] as i8, t[1] as i8, t[2] as i8, t[3] as i8, | ||
| t[4] as i8, t[5] as i8, t[6] as i8, t[7] as i8, | ||
| t[8] as i8, t[9] as i8, t[10] as i8, t[11] as i8, | ||
| t[12] as i8, t[13] as i8, t[14] as i8, t[15] as i8, | ||
| ) | ||
| } | ||
|
|
||
| /// Classify a 32-byte chunk for string validation. | ||
| /// | ||
| /// Returns a bitmask (one bit per byte) where set bits indicate bytes | ||
| /// that have any interesting class bit (CTRL | BS | HIGH). Zero means | ||
| /// the entire chunk is pure printable ASCII without escapes or UTF-8. | ||
| #[cfg(all(target_arch = "x86_64", feature = "avx2"))] | ||
| #[target_feature(enable = "avx2")] | ||
| pub(crate) unsafe fn classify_str_chunk(chunk: __m256i) -> u32 { | ||
| classify_str_mask(chunk) | ||
| } | ||
|
|
||
| /// Returns a bitmask of bytes that match CTRL | BS | HIGH. | ||
| #[cfg(all(target_arch = "x86_64", feature = "avx2"))] | ||
| #[target_feature(enable = "avx2")] | ||
| pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 { | ||
| let lo_lut = make_lut(&STR_LO_TABLE); | ||
| let hi_lut = make_lut(&STR_HI_TABLE); | ||
| let classes = classify_chunk(chunk, lo_lut, hi_lut); | ||
| let zero = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256()); | ||
| let zero_mask = _mm256_movemask_epi8(zero) as u32; | ||
| zero_mask ^ 0xFFFF_FFFF // invert: 1 = interesting | ||
| } |
| Numbers below come from one such run. | ||
|
|
||
| ## Results — throughput (median ops/s) | ||
|
|
||
| Each row is "parse + access request fields" on the named payload. | ||
|
|
||
| | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | | ||
| |---|---:|---:|---:|---:|---:|---:| | ||
| | small | 2.1 KB | 94,075 | 108,108 | 127,214 | 120,398 | 203,666 | | ||
| | medium | 60.4 KB | 9,041 | 83,043 | 123,487 | 214,500 | 214,408 | | ||
| | github-100k | 100 KB | 2,238 | 2,047 | 6,010 | 5,994 | 6,701 | | ||
| | 100k | 100 KB | 5,302 | 32,248 | 109,649 | 102,564 | 114,548 | | ||
| | 200k | 200 KB | 2,659 | 19,040 | 90,090 | 92,251 | 106,383 | | ||
| | 500k | 500 KB | 1,052 | 7,062 | 34,722 | 35,336 | 37,453 | | ||
| | 1m | 1.00 MB | 517 | 3,538 | 16,520 | 16,988 | 17,261 | | ||
| | 2m | 2.00 MB | 258 | 2,026 | 9,021 | 8,580 | 9,033 | | ||
| | 5m | 5.00 MB | 102 | 663 | 2,982 | 3,728 | 3,829 | | ||
| | 10m | 10.00 MB | 50 | 402 | 1,899 | 1,918 | 1,925 | | ||
| | interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 | | ||
| |---|---|---:|---:|---:|---:|---:|---:| | ||
| | small | 2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 | |
| | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | | ||
| |---|---:|---:|---:|---:|---:|---:| | ||
| | small | 2.1 KB | 94,075 | 108,108 | 127,214 | 120,398 | 203,666 | | ||
| | medium | 60.4 KB | 9,041 | 83,043 | 123,487 | 214,500 | 214,408 | | ||
| | github-100k | 100 KB | 2,238 | 2,047 | 6,010 | 5,994 | 6,701 | | ||
| | 100k | 100 KB | 5,302 | 32,248 | 109,649 | 102,564 | 114,548 | | ||
| | 200k | 200 KB | 2,659 | 19,040 | 90,090 | 92,251 | 106,383 | | ||
| | 500k | 500 KB | 1,052 | 7,062 | 34,722 | 35,336 | 37,453 | | ||
| | 1m | 1.00 MB | 517 | 3,538 | 16,520 | 16,988 | 17,261 | | ||
| | 2m | 2.00 MB | 258 | 2,026 | 9,021 | 8,580 | 9,033 | | ||
| | 5m | 5.00 MB | 102 | 663 | 2,982 | 3,728 | 3,829 | | ||
| | 10m | 10.00 MB | 50 | 402 | 1,899 | 1,918 | 1,925 | | ||
| | interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 | | ||
| |---|---|---:|---:|---:|---:|---:|---:| | ||
| | small | 2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 | | ||
| | medium | 60.4 KB | 8,701 | 77,936 | 135,700 | 177,650 | 164,142 | | ||
| | github-100k | 100 KB | 2,106 | 2,247 | 5,964 | 5,900 | 6,321 | | ||
| | cjk-100k | 99 KB | 2,203 | 2,367 | 4,965 | 5,363 | 6,063 | | ||
| | 100k | 100 KB | 4,985 | 32,232 | 130,621 | 125,348 | 145,613 | |
| @@ -115,18 +117,19 @@ | |||
| from the last round may still be included. | |||
|
|
|||
| | Scenario | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | | |||
| |---|---:|---:|---:|---:|---:| | |||
| | small | +15,493 | +15,500 | +4,066 | +15,116 | +11,140 | | |||
| | medium | +1,955 | +2,660 | +333 | +1,114 | +1,120 | | |||
| | github-100k | +12,018 | +3,527 | +14 | +536 | +230 | | |||
| | 100k | +485 | +748 | +67 | +692 | +229 | | |||
| | 200k | +392 | +523 | +34 | +346 | +112 | | |||
| | 500k | +577 | +630 | +14 | +139 | +45 | | |||
| | 1m | +1,082 | +1,121 | +10 | +104 | +34 | | |||
| | 2m | +1,155 | +1,248 | +14 | +208 | +45 | | |||
| | 5m | +1,316 | +1,538 | +14 | +400 | +45 | | |||
| | 10m | +1,583 | +2,014 | +14 | +708 | +45 | | |||
| | interleaved | +3,356 | +4,404 | +268 | +2,771 | +897 | | |||
| |---|---|---:|---:|---:|---:|---:| | |||
| | small | -2,359 | +8,055 | +8,159 | +8,643 | +2,701 | | |||
| | 5m | +1,316 | +1,538 | +14 | +400 | +45 | | ||
| | 10m | +1,583 | +2,014 | +14 | +708 | +45 | | ||
| | interleaved | +3,356 | +4,404 | +268 | +2,771 | +897 | | ||
| |---|---|---:|---:|---:|---:|---:| |
| ### Speed-up vs. baselines | ||
|
|
||
| | Scenario | `qjson.parse` / cjson | `qjson.parse` / simdjson | `qjson.decode + access content` / cjson | `qjson.decode + access content` / simdjson | | ||
| |---|---:|---:|---:|---:| | ||
| | small | 1.4× | 1.2× | 1.3× | 1.1× | | ||
| | medium | 13.7× | 1.5× | 23.7× | 2.6× | | ||
| | github-100k | 2.7× | 2.9× | 2.7× | 2.9× | | ||
| | 100k | 20.7× | 3.4× | 19.3× | 3.2× | | ||
| | 200k | 33.9× | 4.7× | 34.7× | 4.8× | | ||
| | 500k | 33.0× | 4.9× | 33.6× | 5.0× | | ||
| | 1m | 32.0× | 4.7× | 32.9× | 4.8× | | ||
| | 2m | 35.0× | 4.5× | 33.3× | 4.2× | | ||
| | 5m | 29.2× | 4.5× | 36.5× | 5.6× | | ||
| | 10m | 38.0× | 4.7× | 38.4× | 4.8× | | ||
| |---|---|---:|---:|---:|---:| | ||
| | small | 1.3× | 1.2× | 1.0× | 1.0× | | ||
| | medium | 15.6× | 1.7× | 20.4× | 2.3× | | ||
| | github-100k | 2.8× | 2.7× | 2.8× | 2.6× | | ||
| | cjk-100k | 2.3× | 2.1× | 2.4× | 2.3× | | ||
| | 100k | 26.2× | 4.1× | 25.1× | 3.9× | |
- Fused validator: check trailing content before string/scalar validation
to preserve old validate_trailing error-code precedence
(e.g. '\"\\q" x' → QJSON_TRAILING_CONTENT, not QJSON_INVALID_STRING)
- Fused validator: detect TopDone+structural as QJSON_TRAILING_CONTENT
(e.g. '42 {}' → trailing, not PARSE_ERROR)
- classify_str_mask: precompute LUT vectors as 32-byte aligned statics,
load with _mm256_load_si256 instead of rebuilding per call
- benchmarks.md: fix table separator column counts, update
'5 rounds'→'10 rounds', 'median'→'mean' in section titles
- Add 3 regression tests for error-code precedence
Summary
Fuse 3 post-scan eager validation passes (validate_depth, validate_trailing, validate_eager_values) into a single O(indices) traversal, and replace the old AVX2 string validator with a PSHUFB nibble-LUT byte classifier.
Changes
validate/mod.rs,doc.rs): merge depth checking and trailing-content detection into the grammar state machine. Eliminates 2 redundant indices traversals.classify.rs): new shared byte-classification module using_mm256_shuffle_epi8nibble-LUT. Classifies all 32 bytes in a chunk simultaneously instead of doing 3 separate SIMD compares.strings/avx2.rs): uses classifier instead of "find-first-interesting-then-scalar" approach. Escape sequences and UTF-8 triggers are processed in-batch.Performance (1MB JSON, 10-run avg ± stddev)
Testing