perf: fuse eager validation passes + PSHUFB string classifier by membphis · Pull Request #52 · api7/lua-qjson

membphis · 2026-05-22T17:32:02Z

Summary

Fuse 3 post-scan eager validation passes (validate_depth, validate_trailing, validate_eager_values) into a single O(indices) traversal, and replace the old AVX2 string validator with a PSHUFB nibble-LUT byte classifier.

Changes

Pass fusion (validate/mod.rs, doc.rs): merge depth checking and trailing-content detection into the grammar state machine. Eliminates 2 redundant indices traversals.
PSHUFB classifier (classify.rs): new shared byte-classification module using _mm256_shuffle_epi8 nibble-LUT. Classifies all 32 bytes in a chunk simultaneously instead of doing 3 separate SIMD compares.
AVX2 string validator rewrite (strings/avx2.rs): uses classifier instead of "find-first-interesting-then-scalar" approach. Escape sequences and UTF-8 triggers are processed in-batch.

Performance (1MB JSON, 10-run avg ± stddev)

Payload	Before	After	Improvement
GitHub-style REST API (pure ASCII)	1688 ± 97 us	1462 ± 39 us	13.4%
Escape-heavy (\n \t \ \uXXXX)	912 ± 77 us	776 ± 30 us	14.9%

Testing

All 312 existing tests pass (default features + scalar-only)
Clippy clean (-D warnings)
Scanner crosscheck (scalar vs AVX2) passes
JSONTestSuite conformance preserved
Third-party fixture tests (cJSON, simdjson) pass

coderabbitai · 2026-05-22T17:32:10Z

Warning

Rate limit exceeded

@membphis has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 58 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8d5ed18c-383d-4305-bcce-5169c1d3b528

📥 Commits

Reviewing files that changed from the base of the PR and between 8e49d18 and 3f6822a.

📒 Files selected for processing (9)

CLAUDE.md
README.md
benches/lua_bench.lua
docs/benchmarks.md
src/decode/number.rs
src/doc.rs
src/validate/classify.rs
src/validate/mod.rs
src/validate/strings/avx2.rs

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fuse-eager-passes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Merge validate_depth, validate_trailing, and validate_eager_values into a single fused pass. Replace AVX2 string validation with PSHUFB nibble-LUT byte classifier. Add AVX-512 dual path. Add SIMD number validation.

8 tasks: PSHUFB classifier, AVX2 string rewrite, AVX-512 path, SIMD number validation, pass fusion, doc.rs wiring, full test verification, CLAUDE.md update.

Add #[allow(dead_code)] to classify.rs module and to validate_trailing / validate_eager_values (kept for tests and planned future use). Fix overindented doc list item.

Copilot

Pull request overview

This PR optimizes qjson’s eager validation path by (1) fusing multiple post-scan validation passes into a single traversal over the scanner’s indices, and (2) replacing the AVX2 string fast-path with a PSHUFB (nibble-LUT) byte classifier shared across validation routines.

Changes:

Add validate_eager_fused to merge depth checking, trailing-content detection, and grammar/value validation into one O(indices) walk, and wire it into Document::parse_with_options for eager mode.
Introduce a new PSHUFB nibble-LUT classifier module and rewrite AVX2 string validation to use its attention mask.
Add design/plan documentation for the SIMD + pass-fusion work; update .gitignore.

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/validate/strings/avx2.rs`	Reworks AVX2 string-span validation to iterate “interesting” bytes using the classifier-derived mask.
`src/validate/mod.rs`	Adds `validate_eager_fused` and test coverage for fused behavior; keeps older passes for lazy mode/tests.
`src/validate/classify.rs`	New shared PSHUFB nibble-LUT classifier + exhaustive LUT correctness tests.
`src/doc.rs`	Switches eager parsing to the fused validator; retains lazy depth-only validation.
`docs/superpowers/specs/2026-05-22-fuse-eager-simd-design.md`	Design spec describing pass fusion + SIMD classifier approach.
`docs/superpowers/plans/2026-05-22-fuse-eager-simd-plan.md`	Implementation plan/checklist for the overall optimization effort.
`.gitignore`	Adds an ignore rule for `docs/superpowers/specs/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                strings::validate_string_span(&buf[pos + 1 .. close])?;
+
+                let cur = stack.last_mut().ok_or(qjson_err::QJSON_PARSE_ERROR)?;
+                match *cur {
+                    CtxKind::ObjAfterOpen | CtxKind::ObjAfterComma => {
+                        *cur = CtxKind::ObjAfterKey;
+                    }
+                    CtxKind::Top
+                    | CtxKind::ArrAfterOpen
+                    | CtxKind::ArrAfterComma
+                    | CtxKind::ObjAfterColon => {


+    // token, validate it, then check for trailing content beyond it.
+    if matches!(*stack.last().unwrap(), CtxKind::Top) && depth == 0 {
+        let mut scan = prev_end;
+        while scan < buf.len() && is_ws(buf[scan]) { scan += 1; }
+        if scan < buf.len() {
+            let mut end = scan;
+            while end < buf.len() && !is_ws(buf[end]) { end += 1; }
+            validate_scalar(&buf[scan..end])?;
+            *stack.last_mut().unwrap() = CtxKind::TopDone;
+
+            let mut p = end;
+            while p < buf.len() && is_ws(buf[p]) { p += 1; }
+            if p < buf.len() {
+                return Err(qjson_err::QJSON_TRAILING_CONTENT);
+            }


+pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 {
+    let lo_lut     = make_lut(&STR_LO_TABLE);
+    let hi_lut     = make_lut(&STR_HI_TABLE);
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);


+The eager decode path (`Document::parse_with_options` in `src/doc.rs`) runs **4 independent passes** over the `indices` array after structural scanning:
+
+1. `validate_depth` — depth counting
+2. `validate_trailing` — reject trailing non-whitespace
+3. `validate_eager_values` — grammar state machine + string validation + number validation
+
+Each pass is a scalar O(indices) walk. Additionally, string validation SIMD (`strings/avx2.rs`) is conservative: it hands off to scalar on the *first* interesting byte found (backslash, control, or high-bit), leaving most of the SIMD register width unused on mixed content. Number validation has no SIMD path at all.


+| 3   | Printable ASCII (0x20..0x7E, excluding backslash) |
+
+**Algorithm per 32-byte chunk:**
+1. Split each byte into high-nibble and low-nibble via shift + mask.
+2. `_mm256_shuffle_epi8(lo_nibble, lo_lut)` and `_mm256_shuffle_epi8(hi_nibble, hi_lut)`.
+3. AND low and high LUT results → per-byte class bitmask.
+4. If any bit 0 set → `QJSON_INVALID_STRING` (control char).
+5. If bits 1 and 2 are zero → pure printable ASCII, advance 32 bytes.
+6. Otherwise: scan class bitmask for backslash positions, validate escape sequences; for high-bit bytes, run SIMD-enhanced UTF-8 validation.
+
+Key improvement: the classifier tells us **exactly which bytes need what kind of attention**, rather than a binary "there's a problem here". Multiple backslashes in one chunk are all located without re-scanning. High-bit bytes are identified by position, enabling batch UTF-8 validation.


… PI literal)

- README: update Status section with fused validation + PSHUFB description - benchmarks.md: add eager validation micro-benchmark section (13-15% improvement), update observation #3 - CLAUDE.md: update Phase 1 architecture and Layout sections

Note 13-15% eager validation improvement from fused pass + PSHUFB classifier. Clarify that Lua bench numbers already include this.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

+                        } else {
+                            CtxKind::ArrAfterOpen
+                        });
+                    }


+pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 {
+    let lo_lut     = make_lut(&STR_LO_TABLE);
+    let hi_lut     = make_lut(&STR_HI_TABLE);
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);
+    let zero       = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256());
+    let zero_mask  = _mm256_movemask_epi8(zero) as u32;
+    zero_mask ^ 0xFFFF_FFFF   // invert: 1 = interesting
+}


Update README and benchmarks.md tables with actual throughput, speedup, and memory delta numbers from make bench on current branch (includes fused validation + decode deduplication).

- Increase ROUNDS from 5 to 10 for noise reduction - Switch from median to mean ops/s across rounds - Update all 3 tables (throughput, speedup, memory) with fresh make bench data on current branch

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

+pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 {
+    let lo_lut     = make_lut(&STR_LO_TABLE);
+    let hi_lut     = make_lut(&STR_HI_TABLE);
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);


 | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
-|---|---:|---:|---:|---:|---:|---:|
-| small      |   2.1 KB |  94,075 | 108,108 | 127,214 | 120,398 | 203,666 |
-| medium     |  60.4 KB |   9,041 |  83,043 | 123,487 | 214,500 | 214,408 |
-| github-100k |   100 KB |   2,238 |   2,047 |   6,010 |   5,994 |   6,701 |
-| 100k       |   100 KB |   5,302 |  32,248 | 109,649 | 102,564 | 114,548 |
-| 200k       |   200 KB |   2,659 |  19,040 |  90,090 |  92,251 | 106,383 |
-| 500k       |   500 KB |   1,052 |   7,062 |  34,722 |  35,336 |  37,453 |
-| 1m         |  1.00 MB |     517 |   3,538 |  16,520 |  16,988 |  17,261 |
-| 2m         |  2.00 MB |     258 |   2,026 |   9,021 |   8,580 |   9,033 |
-| 5m         |  5.00 MB |     102 |     663 |   2,982 |   3,728 |   3,829 |
-| 10m        | 10.00 MB |      50 |     402 |   1,899 |   1,918 |   1,925 |
-| interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 |
+|---|---|---:|---:|---:|---:|---:|---:|
+| small      |   2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 |


-| 2m     | 35.0× |  4.5× | 33.3× |  4.2× |
-| 5m     | 29.2× |  4.5× | 36.5× |  5.6× |
-| 10m    | 38.0× |  4.7× | 38.4× |  4.8× |
+|---|---|---:|---:|---:|---:|


 | Scenario | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
-|---|---:|---:|---:|---:|---:|
-| small      | +15,493 | +15,500 | +4,066 | +15,116 | +11,140 |
-| medium     |  +1,955 |  +2,660 |   +333 |  +1,114 |  +1,120 |
-| github-100k | +12,018 | +3,527 |    +14 |    +536 |    +230 |
-| 100k       |    +485 |   +748 |    +67 |    +692 |    +229 |
-| 200k       |    +392 |   +523 |    +34 |    +346 |    +112 |
-| 500k       |    +577 |   +630 |    +14 |    +139 |     +45 |
-| 1m         |  +1,082 | +1,121 |    +10 |    +104 |     +34 |
-| 2m         |  +1,155 | +1,248 |    +14 |    +208 |     +45 |
-| 5m         |  +1,316 | +1,538 |    +14 |    +400 |     +45 |
-| 10m        |  +1,583 | +2,014 |    +14 |    +708 |     +45 |
-| interleaved | +3,356 | +4,404 |   +268 |  +2,771 |    +897 |
+|---|---|---:|---:|---:|---:|---:|
+| small      | -2,359 |  +8,055 |  +8,159 |  +8,643 |  +2,701 |


~100 KB array of objects with Chinese text, emoji, and mixed ASCII/CJK field names. Stresses PSHUFB byte classifier (high-bit bytes) and UTF-8 validation path. Pre: 4,720 ops/s → Post: 5,064 ops/s (+7.3%)

100 KB array of objects with Chinese text and emoji. Stresses PSHUFB UTF-8/high-bit classification. Pre-opt: 4,720 ops/s → Post-opt: 4,605 ops/s. qjson memory delta: +26 KB (cjson: +17 MB).

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

+pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 {
+    let lo_lut     = make_lut(&STR_LO_TABLE);
+    let hi_lut     = make_lut(&STR_HI_TABLE);
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);
+    let zero       = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256());
+    let zero_mask  = _mm256_movemask_epi8(zero) as u32;
+    zero_mask ^ 0xFFFF_FFFF   // invert: 1 = interesting


-| 5m         |  5.00 MB |     102 |     663 |   2,982 |   3,728 |   3,829 |
-| 10m        | 10.00 MB |      50 |     402 |   1,899 |   1,918 |   1,925 |
-| interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 |
+|---|---|---:|---:|---:|---:|---:|---:|


+| 5m     | 31.4× |  4.9× | 31.5× |  4.9× |
+| 10m    | 29.5× |  3.8× | 31.0× |  4.0× |

 ## Results — memory delta (KB retained after 5 rounds)


-| 5m         |  +1,316 | +1,538 |    +14 |    +400 |     +45 |
-| 10m        |  +1,583 | +2,014 |    +14 |    +708 |     +45 |
-| interleaved | +3,356 | +4,404 |   +268 |  +2,771 |    +897 |
+|---|---|---:|---:|---:|---:|---:|


@@ -80,33 +80,35 @@ Numbers below come from one such run.
 Each row is "parse + access request fields" on the named payload.


- Replace safe_sub-based truncation with integer multiples of cjk_body to avoid splitting multi-byte sequences - Skip simdjson for cjk scenario (no_simdjson flag) - Add safe_sub utility (unused now, kept for potential future use) - Update cjk-100k data: qjson.parse mean 5,018 ops/s (2.3x vs cjson)

Remove no_simdjson flag. simdjson mean 2,367 ops/s on cjk-100k. qjson.parse/cjson = 2.3×, qjson.parse/simdjson = 2.1×.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

+    // token, validate it, then check for trailing content beyond it.
+    if matches!(*stack.last().unwrap(), CtxKind::Top) && depth == 0 {
+        let mut scan = prev_end;
+        while scan < buf.len() && is_ws(buf[scan]) { scan += 1; }
+        if scan < buf.len() {
+            let mut end = scan;
+            while end < buf.len() && !is_ws(buf[end]) { end += 1; }
+            validate_scalar(&buf[scan..end])?;
+            *stack.last_mut().unwrap() = CtxKind::TopDone;
+
+            let mut p = end;
+            while p < buf.len() && is_ws(buf[p]) { p += 1; }
+            if p < buf.len() {
+                return Err(qjson_err::QJSON_TRAILING_CONTENT);
+            }


+pub(crate) unsafe fn classify_chunk(chunk: __m256i, lo_lut: __m256i, hi_lut: __m256i) -> __m256i {
+    let nib_mask = _mm256_set1_epi8(0x0Fu8 as i8);
+
+    let lo_nibs   = _mm256_and_si256(chunk, nib_mask);
+    let hi_shift  = _mm256_srli_epi32::<4>(chunk);
+    let hi_nibs   = _mm256_and_si256(hi_shift, nib_mask);
+
+    let lo_class = _mm256_shuffle_epi8(lo_lut, lo_nibs);
+    let hi_class = _mm256_shuffle_epi8(hi_lut, hi_nibs);
+
+    _mm256_and_si256(lo_class, hi_class)
+}
+
+/// Build a 32-byte `__m256i` from a 16-entry nibble LUT by duplicating
+/// the table into both 128-bit lanes.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+unsafe fn make_lut(table: &[u8; 16]) -> __m256i {
+    let t = table;
+    _mm256_setr_epi8(
+        t[0]  as i8, t[1]  as i8, t[2]  as i8, t[3]  as i8,
+        t[4]  as i8, t[5]  as i8, t[6]  as i8, t[7]  as i8,
+        t[8]  as i8, t[9]  as i8, t[10] as i8, t[11] as i8,
+        t[12] as i8, t[13] as i8, t[14] as i8, t[15] as i8,
+        t[0]  as i8, t[1]  as i8, t[2]  as i8, t[3]  as i8,
+        t[4]  as i8, t[5]  as i8, t[6]  as i8, t[7]  as i8,
+        t[8]  as i8, t[9]  as i8, t[10] as i8, t[11] as i8,
+        t[12] as i8, t[13] as i8, t[14] as i8, t[15] as i8,
+    )
+}
+
+/// Classify a 32-byte chunk for string validation.
+///
+/// Returns a bitmask (one bit per byte) where set bits indicate bytes
+/// that have any interesting class bit (CTRL | BS | HIGH). Zero means
+/// the entire chunk is pure printable ASCII without escapes or UTF-8.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn classify_str_chunk(chunk: __m256i) -> u32 {
+    classify_str_mask(chunk)
+}
+
+/// Returns a bitmask of bytes that match CTRL | BS | HIGH.
+#[cfg(all(target_arch = "x86_64", feature = "avx2"))]
+#[target_feature(enable = "avx2")]
+pub(crate) unsafe fn classify_str_mask(chunk: __m256i) -> u32 {
+    let lo_lut     = make_lut(&STR_LO_TABLE);
+    let hi_lut     = make_lut(&STR_HI_TABLE);
+    let classes    = classify_chunk(chunk, lo_lut, hi_lut);
+    let zero       = _mm256_cmpeq_epi8(classes, _mm256_setzero_si256());
+    let zero_mask  = _mm256_movemask_epi8(zero) as u32;
+    zero_mask ^ 0xFFFF_FFFF   // invert: 1 = interesting
+}


 Numbers below come from one such run.

 ## Results — throughput (median ops/s)

 Each row is "parse + access request fields" on the named payload.

 | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
-|---|---:|---:|---:|---:|---:|---:|
-| small      |   2.1 KB |  94,075 | 108,108 | 127,214 | 120,398 | 203,666 |
-| medium     |  60.4 KB |   9,041 |  83,043 | 123,487 | 214,500 | 214,408 |
-| github-100k |   100 KB |   2,238 |   2,047 |   6,010 |   5,994 |   6,701 |
-| 100k       |   100 KB |   5,302 |  32,248 | 109,649 | 102,564 | 114,548 |
-| 200k       |   200 KB |   2,659 |  19,040 |  90,090 |  92,251 | 106,383 |
-| 500k       |   500 KB |   1,052 |   7,062 |  34,722 |  35,336 |  37,453 |
-| 1m         |  1.00 MB |     517 |   3,538 |  16,520 |  16,988 |  17,261 |
-| 2m         |  2.00 MB |     258 |   2,026 |   9,021 |   8,580 |   9,033 |
-| 5m         |  5.00 MB |     102 |     663 |   2,982 |   3,728 |   3,829 |
-| 10m        | 10.00 MB |      50 |     402 |   1,899 |   1,918 |   1,925 |
-| interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 |
+|---|---|---:|---:|---:|---:|---:|---:|
+| small      |   2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 |


 | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
-|---|---:|---:|---:|---:|---:|---:|
-| small      |   2.1 KB |  94,075 | 108,108 | 127,214 | 120,398 | 203,666 |
-| medium     |  60.4 KB |   9,041 |  83,043 | 123,487 | 214,500 | 214,408 |
-| github-100k |   100 KB |   2,238 |   2,047 |   6,010 |   5,994 |   6,701 |
-| 100k       |   100 KB |   5,302 |  32,248 | 109,649 | 102,564 | 114,548 |
-| 200k       |   200 KB |   2,659 |  19,040 |  90,090 |  92,251 | 106,383 |
-| 500k       |   500 KB |   1,052 |   7,062 |  34,722 |  35,336 |  37,453 |
-| 1m         |  1.00 MB |     517 |   3,538 |  16,520 |  16,988 |  17,261 |
-| 2m         |  2.00 MB |     258 |   2,026 |   9,021 |   8,580 |   9,033 |
-| 5m         |  5.00 MB |     102 |     663 |   2,982 |   3,728 |   3,829 |
-| 10m        | 10.00 MB |      50 |     402 |   1,899 |   1,918 |   1,925 |
-| interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 |
+|---|---|---:|---:|---:|---:|---:|---:|
+| small      |   2.1 KB | 100,127 | 109,588 | 130,867 | 105,038 | 210,886 |
+| medium     |  60.4 KB |   8,701 |  77,936 | 135,700 | 177,650 | 164,142 |
+| github-100k |   100 KB |   2,106 |   2,247 |   5,964 |   5,900 |   6,321 |
+| cjk-100k  |    99 KB |   2,203 |   2,367 |   4,965 |   5,363 |   6,063 |
+| 100k       |   100 KB |   4,985 |  32,232 | 130,621 | 125,348 | 145,613 |


@@ -115,18 +117,19 @@
 from the last round may still be included.

 | Scenario | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` |
-|---|---:|---:|---:|---:|---:|
-| small      | +15,493 | +15,500 | +4,066 | +15,116 | +11,140 |
-| medium     |  +1,955 |  +2,660 |   +333 |  +1,114 |  +1,120 |
-| github-100k | +12,018 | +3,527 |    +14 |    +536 |    +230 |
-| 100k       |    +485 |   +748 |    +67 |    +692 |    +229 |
-| 200k       |    +392 |   +523 |    +34 |    +346 |    +112 |
-| 500k       |    +577 |   +630 |    +14 |    +139 |     +45 |
-| 1m         |  +1,082 | +1,121 |    +10 |    +104 |     +34 |
-| 2m         |  +1,155 | +1,248 |    +14 |    +208 |     +45 |
-| 5m         |  +1,316 | +1,538 |    +14 |    +400 |     +45 |
-| 10m        |  +1,583 | +2,014 |    +14 |    +708 |     +45 |
-| interleaved | +3,356 | +4,404 |   +268 |  +2,771 |    +897 |
+|---|---|---:|---:|---:|---:|---:|
+| small      | -2,359 |  +8,055 |  +8,159 |  +8,643 |  +2,701 |


-| 5m         |  +1,316 | +1,538 |    +14 |    +400 |     +45 |
-| 10m        |  +1,583 | +2,014 |    +14 |    +708 |     +45 |
-| interleaved | +3,356 | +4,404 |   +268 |  +2,771 |    +897 |
+|---|---|---:|---:|---:|---:|---:|


 ### Speed-up vs. baselines

 | Scenario | `qjson.parse` / cjson | `qjson.parse` / simdjson | `qjson.decode + access content` / cjson | `qjson.decode + access content` / simdjson |
-|---|---:|---:|---:|---:|
-| small  |  1.4× |  1.2× |  1.3× |  1.1× |
-| medium | 13.7× |  1.5× | 23.7× |  2.6× |
-| github-100k | 2.7× |  2.9× | 2.7× |  2.9× |
-| 100k   | 20.7× |  3.4× | 19.3× |  3.2× |
-| 200k   | 33.9× |  4.7× | 34.7× |  4.8× |
-| 500k   | 33.0× |  4.9× | 33.6× |  5.0× |
-| 1m     | 32.0× |  4.7× | 32.9× |  4.8× |
-| 2m     | 35.0× |  4.5× | 33.3× |  4.2× |
-| 5m     | 29.2× |  4.5× | 36.5× |  5.6× |
-| 10m    | 38.0× |  4.7× | 38.4× |  4.8× |
+|---|---|---:|---:|---:|---:|
+| small  |  1.3× |  1.2× |  1.0× |  1.0× |
+| medium | 15.6× |  1.7× | 20.4× |  2.3× |
+| github-100k | 2.8× |  2.7× | 2.8× |  2.6× |
+| cjk-100k  | 2.3× |  2.1× | 2.4× |  2.3× |
+| 100k   | 26.2× |  4.1× | 25.1× |  3.9× |


- Fused validator: check trailing content before string/scalar validation to preserve old validate_trailing error-code precedence (e.g. '\"\\q" x' → QJSON_TRAILING_CONTENT, not QJSON_INVALID_STRING) - Fused validator: detect TopDone+structural as QJSON_TRAILING_CONTENT (e.g. '42 {}' → trailing, not PARSE_ERROR) - classify_str_mask: precompute LUT vectors as 32-byte aligned statics, load with _mm256_load_si256 instead of rebuilding per call - benchmarks.md: fix table separator column counts, update '5 rounds'→'10 rounds', 'median'→'mean' in section titles - Add 3 regression tests for error-code precedence

Copilot AI review requested due to automatic review settings May 22, 2026 17:32

Copilot started reviewing on behalf of membphis May 22, 2026 17:32 View session

membphis added 6 commits May 22, 2026 17:34

docs: add eager SIMD optimization design spec

88bf464

Merge validate_depth, validate_trailing, and validate_eager_values into a single fused pass. Replace AVX2 string validation with PSHUFB nibble-LUT byte classifier. Add AVX-512 dual path. Add SIMD number validation.

docs: add fuse-eager SIMD implementation plan

9e9ac75

8 tasks: PSHUFB classifier, AVX2 string rewrite, AVX-512 path, SIMD number validation, pass fusion, doc.rs wiring, full test verification, CLAUDE.md update.

feat: add PSHUFB nibble-LUT byte classifier module

f6c9524

perf: rewrite AVX2 string validation with PSHUFB classifier

bc6f8a6

perf: add validate_eager_fused merging depth+trailing+grammar

f06c5fa

chore: fix clippy dead_code and doc warnings

af73d5b

Add #[allow(dead_code)] to classify.rs module and to validate_trailing / validate_eager_values (kept for tests and planned future use). Fix overindented doc list item.

Copilot AI reviewed May 22, 2026

View reviewed changes

chore: fix clippy warnings from rebase (collapsed if, same_item_push,…

c62b28d

… PI literal)

membphis force-pushed the fuse-eager-passes branch from ddb53eb to c62b28d Compare May 22, 2026 17:38

chore: remove docs/superpowers and gitignore entry

5d0bc0f

Copilot AI review requested due to automatic review settings May 22, 2026 17:40

Copilot started reviewing on behalf of membphis May 22, 2026 17:40 View session

membphis added 2 commits May 22, 2026 17:41

docs: add Rust micro-benchmark reference to README

4d52f6b

Note 13-15% eager validation improvement from fused pass + PSHUFB classifier. Clarify that Lua bench numbers already include this.

Copilot AI reviewed May 22, 2026

View reviewed changes

membphis added 2 commits May 22, 2026 17:46

docs: update benchmark data from current make bench run

38a6be3

Update README and benchmarks.md tables with actual throughput, speedup, and memory delta numbers from make bench on current branch (includes fused validation + decode deduplication).

docs: refresh bench data with 10-round mean values

24ff77d

- Increase ROUNDS from 5 to 10 for noise reduction - Switch from median to mean ops/s across rounds - Update all 3 tables (throughput, speedup, memory) with fresh make bench data on current branch

Copilot AI review requested due to automatic review settings May 22, 2026 17:54

Copilot started reviewing on behalf of membphis May 22, 2026 17:55 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

membphis added 2 commits May 22, 2026 18:04

bench: add cjk-100k scenario with CJK+emoji content

9f5f0f1

~100 KB array of objects with Chinese text, emoji, and mixed ASCII/CJK field names. Stresses PSHUFB byte classifier (high-bit bytes) and UTF-8 validation path. Pre: 4,720 ops/s → Post: 5,064 ops/s (+7.3%)

docs: add cjk-100k CJK+emoji benchmark row to tables

96c13c9

100 KB array of objects with Chinese text and emoji. Stresses PSHUFB UTF-8/high-bit classification. Pre-opt: 4,720 ops/s → Post-opt: 4,605 ops/s. qjson memory delta: +26 KB (cjson: +17 MB).

Copilot AI review requested due to automatic review settings May 22, 2026 18:06

Copilot started reviewing on behalf of membphis May 22, 2026 18:06 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

membphis added 2 commits May 22, 2026 18:23

bench: add simdjson data for cjk-100k scenario

708625b

Remove no_simdjson flag. simdjson mean 2,367 ops/s on cjk-100k. qjson.parse/cjson = 2.3×, qjson.parse/simdjson = 2.1×.

Copilot AI review requested due to automatic review settings May 22, 2026 18:27

Copilot started reviewing on behalf of membphis May 22, 2026 18:27 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: fuse eager validation passes + PSHUFB string classifier#52

perf: fuse eager validation passes + PSHUFB string classifier#52
membphis wants to merge 17 commits into
mainfrom
fuse-eager-passes

membphis commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -80,33 +80,35 @@ Numbers below come from one such run.
		Each row is "parse + access request fields" on the named payload.

Conversation

membphis commented May 22, 2026

Summary

Changes

Performance (1MB JSON, 10-run avg ± stddev)

Testing

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 22, 2026 •

edited

Loading