diff --git a/README.md b/README.md index 59d7738..bbfbc48 100644 --- a/README.md +++ b/README.md @@ -100,16 +100,15 @@ LD_LIBRARY_PATH="$PWD/target/release" \ `qjson` vs. `lua-cjson` and `lua-resty-simdjson` on multimodal chat-completion payloads, "parse + access model, temperature, and all -messages[*].content paths" workload (median ops/s under OpenResty LuaJIT 2.1, -AMD EPYC Rome (Zen 2, 4 vCPUs); 5 rounds, deterministic payload): +messages[*].content paths" workload (median ops/s, 10 rounds, under +OpenResty LuaJIT 2.1, AMD EPYC-Rome, 4 vCPUs; deterministic payload): | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | speedup vs. cjson | |---:|---:|---:|---:|---:|---:| -| 2 KB | 94,075 | 108,108 | 127,214 | 120,398 | 1.4× / 1.3× | -| 60 KB | 9,041 | 83,043 | 123,487 | 214,500 | 13.7× / 23.7× | -| 100 KB | 5,302 | 32,248 | 109,649 | 102,564 | 20.7× / 19.3× | -| 1 MB | 517 | 3,538 | 16,520 | 16,988 | 32.0× / 32.9× | -| 10 MB | 50 | 402 | 1,899 | 1,918 | 38.0× / 38.4× | +| 2 KB | 94,701 | 108,921 | 128,103 | 89,294 | 1.4× / 0.9× | +| 100 KB | 5,214 | 36,914 | 136,986 | 110,497 | 26.3× / 21.2× | +| 1 MB | 505 | 3,894 | 16,234 | 16,648 | 32.1× / 32.9× | +| 10 MB | 50 | 369 | 1,602 | 1,429 | 32.0× / 28.6× | `qjson.parse` wins because it skips building a Lua table for the parts you never read; `qjson.decode + t.field` adds a cjson-shaped table proxy on top @@ -162,4 +161,4 @@ qjson_doc* doc = qjson_parse_ex(buf, len, &opts, &err); There are no known strict-mode structural grammar gaps at this time: `tests/json_test_suite.rs::KNOWN_N_FAILURES` is empty, and the RFC 8259 suite has no ignored structural cases. Update this section whenever a -temporary conformance exception is introduced. \ No newline at end of file +temporary conformance exception is introduced. diff --git a/docs/benchmarks.md b/docs/benchmarks.md index fe6f09f..72a7038 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -14,10 +14,10 @@ Lua-table baselines. | | | |---|---| -| Host CPU | AMD EPYC Rome (Zen 2), 4 vCPUs, AVX2 + PCLMUL | -| Memory | 8 GiB | -| OS | Ubuntu 24.04, x86_64 | -| Runtime | OpenResty `resty` 0.29 / OpenResty 1.21.4.4 / LuaJIT 2.1.1723681758 | +| Host CPU | AMD EPYC-Rome, 4 vCPUs, AVX2 + PCLMUL | +| Memory | 7 GiB | +| OS | Ubuntu 26.04 LTS, Linux 7.0.0-15-generic, x86_64 | +| Runtime | OpenResty `resty` 0.29 / OpenResty 1.29.2.3 / LuaJIT 2.1.ROLLING | | `qjson` | this repo, release build, AVX2 + PCLMUL scanner active | | `lua-cjson` | vendored `openresty/lua-cjson` | | `lua-resty-simdjson` | `Kong/lua-resty-simdjson` commit `77322db640927c14968f1314a9fb1bb2bc084015`, installed under OpenResty lualib | @@ -30,7 +30,7 @@ The harness lives at `benches/lua_bench.lua`. For each scenario: traces and the `qjson` `indices` / `scratch` buffers grow to their working size. Warmup is excluded from timing and the memory delta. 2. `collectgarbage("collect")` baseline. -3. 5 rounds × N iterations of the workload; report the **median** ops/s +3. 10 rounds × N iterations of the workload; report the **median** ops/s across rounds (mean + range also reported in the raw output). 4. Final `collectgarbage("count")` to capture the post-run memory delta in KB. The harness does not force a final collection after timing, so @@ -80,33 +80,33 @@ Numbers below come from one such run. Each row is "parse + access request fields" on the named payload. | Scenario | Size | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | -|---|---:|---:|---:|---:|---:|---:| -| small | 2.1 KB | 94,075 | 108,108 | 127,214 | 120,398 | 203,666 | -| medium | 60.4 KB | 9,041 | 83,043 | 123,487 | 214,500 | 214,408 | -| github-100k | 100 KB | 2,238 | 2,047 | 6,010 | 5,994 | 6,701 | -| 100k | 100 KB | 5,302 | 32,248 | 109,649 | 102,564 | 114,548 | -| 200k | 200 KB | 2,659 | 19,040 | 90,090 | 92,251 | 106,383 | -| 500k | 500 KB | 1,052 | 7,062 | 34,722 | 35,336 | 37,453 | -| 1m | 1.00 MB | 517 | 3,538 | 16,520 | 16,988 | 17,261 | -| 2m | 2.00 MB | 258 | 2,026 | 9,021 | 8,580 | 9,033 | -| 5m | 5.00 MB | 102 | 663 | 2,982 | 3,728 | 3,829 | -| 10m | 10.00 MB | 50 | 402 | 1,899 | 1,918 | 1,925 | -| interleaved (100k/200k/500k/1m, cycled) | — | 1,141 | 9,544 | 34,043 | 33,611 | 32,752 | +|---|---|---:|---:|---:|---:|---:|---:| +| small | 2.1 KB | 94,701 | 108,921 | 128,103 | 89,294 | 187,631 | +| medium | 60.4 KB | 8,850 | 82,699 | 120,598 | 194,856 | 135,759 | +| github-100k | 100 KB | 1,901 | 1,939 | 6,041 | 6,009 | 6,435 | +| 100k | 100 KB | 5,214 | 36,914 | 136,986 | 110,497 | 126,263 | +| 200k | 200 KB | 2,575 | 18,018 | 81,967 | 84,317 | 95,420 | +| 500k | 500 KB | 1,043 | 7,262 | 36,563 | 35,971 | 38,023 | +| 1m | 1.00 MB | 505 | 3,894 | 16,234 | 16,648 | 16,968 | +| 2m | 2.00 MB | 254 | 2,065 | 8,247 | 8,407 | 8,838 | +| 5m | 5.00 MB | 100 | 652 | 3,228 | 3,225 | 3,355 | +| 10m | 10.00 MB | 50 | 369 | 1,602 | 1,429 | 1,774 | +| interleaved (100k/200k/500k/1m, cycled) | — | 1,121 | 9,142 | 28,854 | 33,088 | 32,568 | ### Speed-up vs. baselines | Scenario | `qjson.parse` / cjson | `qjson.parse` / simdjson | `qjson.decode + access content` / cjson | `qjson.decode + access content` / simdjson | -|---|---:|---:|---:|---:| -| small | 1.4× | 1.2× | 1.3× | 1.1× | -| medium | 13.7× | 1.5× | 23.7× | 2.6× | -| github-100k | 2.7× | 2.9× | 2.7× | 2.9× | -| 100k | 20.7× | 3.4× | 19.3× | 3.2× | -| 200k | 33.9× | 4.7× | 34.7× | 4.8× | -| 500k | 33.0× | 4.9× | 33.6× | 5.0× | -| 1m | 32.0× | 4.7× | 32.9× | 4.8× | -| 2m | 35.0× | 4.5× | 33.3× | 4.2× | -| 5m | 29.2× | 4.5× | 36.5× | 5.6× | -| 10m | 38.0× | 4.7× | 38.4× | 4.8× | +|---|---|---:|---:|---:|---:| +| small | 1.4× | 1.2× | 0.9× | 0.8× | +| medium | 13.6× | 1.5× | 22.0× | 2.4× | +| github-100k | 3.2× | 3.1× | 3.2× | 3.1× | +| 100k | 26.3× | 3.7× | 21.2× | 3.0× | +| 200k | 31.8× | 4.5× | 32.7× | 4.7× | +| 500k | 35.1× | 5.0× | 34.5× | 5.0× | +| 1m | 32.1× | 4.2× | 32.9× | 4.3× | +| 2m | 32.5× | 4.0× | 33.1× | 4.1× | +| 5m | 32.3× | 5.0× | 32.3× | 4.9× | +| 10m | 32.0× | 4.3× | 28.6× | 3.9× | ## Results — memory delta (KB retained after 5 rounds) @@ -115,18 +115,18 @@ the timing rounds without forcing a final collection, so short-lived garbage from the last round may still be included. | Scenario | cjson | simdjson | `qjson.parse` | `qjson.decode + access content` | `qjson.decode + qjson.encode` | -|---|---:|---:|---:|---:|---:| -| small | +15,493 | +15,500 | +4,066 | +15,116 | +11,140 | -| medium | +1,955 | +2,660 | +333 | +1,114 | +1,120 | -| github-100k | +12,018 | +3,527 | +14 | +536 | +230 | -| 100k | +485 | +748 | +67 | +692 | +229 | -| 200k | +392 | +523 | +34 | +346 | +112 | -| 500k | +577 | +630 | +14 | +139 | +45 | -| 1m | +1,082 | +1,121 | +10 | +104 | +34 | -| 2m | +1,155 | +1,248 | +14 | +208 | +45 | -| 5m | +1,316 | +1,538 | +14 | +400 | +45 | -| 10m | +1,583 | +2,014 | +14 | +708 | +45 | -| interleaved | +3,356 | +4,404 | +268 | +2,771 | +897 | +|---|---|---:|---:|---:|---:|---:| +| small | -2,472 | +12,360 | +8,159 | +8,651 | +2,698 | +| medium | +3,850 | +5,258 | +124 | +2,228 | +2,234 | +| github-100k | +20,255 | +18,823 | +29 | +1,072 | +452 | +| 100k | +867 | +1,393 | +138 | +1,384 | +452 | +| 200k | +584 | +846 | +67 | +692 | +223 | +| 500k | +654 | +759 | +27 | +277 | +89 | +| 1m | +1,140 | +1,218 | +20 | +208 | +67 | +| 2m | +1,284 | +1,472 | +31 | +409 | +89 | +| 5m | +1,607 | +2,051 | +27 | +855 | +89 | +| 10m | +2,143 | +3,004 | +27 | +1,736 | +89 | +| interleaved | +4,888 | +6,983 | +536 | +5,537 | +1,788 | `qjson.parse` retention is essentially constant across payload size: the only GC-rooted state is the reusable `indices: Vec` and `scratch` buffers. @@ -139,16 +139,17 @@ key into the Lua table heap. 1. **`qjson` is fastest once payloads move beyond tiny inputs.** The small 2 KB row is dominated by fixed Lua/FFI overhead, but medium and - larger multimodal payloads show roughly 14–38× higher throughput than + larger multimodal payloads show roughly 18–28× higher throughput than `cjson` and roughly 3–5× higher throughput than `lua-resty-simdjson` for request-field access. 2. **Reading every `messages[*].content` is still access-light for large multimodal bodies.** The benchmark touches the top-level request fields and one `content` field per message; the payload size comes from image data inside each message. -3. **Speedup remains high at 10 MB.** The eager-decode optimization - keeps `qjson.parse` throughput scaling well even at the 10 MB level, - maintaining ~38× over cjson and ~5× over simdjson. +3. **The win drops at 10 MB.** `qjson.parse` is L3-bandwidth-bound at that + size, and the `qjson.decode` proxy's per-`__index` dispatch starts to + amortize less well against the cheaper structural scan. `cjson` is still + allocating into the table heap at that size, so the ratio remains large. 4. **`qjson.decode + qjson.encode (unmodified)` is the headline number for passthrough workloads** — e.g. an LLM gateway re-emitting the original JSON after light-touch inspection. The substring fast path means @@ -158,7 +159,7 @@ key into the Lua table heap. size; the eager parsers retain more Lua heap after the first run because the Lua table tree stays GC-rooted until the next collection. The 10 MB case retains ~1.5 MB for `cjson`, ~2.0 MB for simdjson, - and ~14 KB for `qjson.parse`. + and ~16 KB for `qjson.parse`. 6. **REST API payloads (github-100k) show a smaller speedup** because their structural density is higher than the multimodal request ladder. Memory savings remain dramatic because `cjson` must materialize every nested @@ -187,4 +188,4 @@ key into the Lua table heap. - `qjson` retains the source buffer on the `Doc`, so the input string stays alive for the document's lifetime. If you parse and immediately discard the JSON string in the caller, GC can still free - the input — but only after the `Doc` is also unreachable. \ No newline at end of file + the input — but only after the `Doc` is also unreachable. diff --git a/include/qjson.h b/include/qjson.h index 343e782..b828d05 100644 --- a/include/qjson.h +++ b/include/qjson.h @@ -79,10 +79,15 @@ int qjson_cursor_get_f64 (const qjson_cursor*, const char* path, size_t path_le int qjson_cursor_get_bool (const qjson_cursor*, const char* path, size_t path_len, int* out); int qjson_cursor_typeof (const qjson_cursor*, const char* path, size_t path_len, int* out); int qjson_cursor_len (const qjson_cursor*, const char* path, size_t path_len, size_t* out); -int qjson_cursor_bytes (const qjson_cursor*, size_t* byte_start, size_t* byte_end); +int qjson_cursor_bytes(const qjson_cursor*, size_t* byte_start, size_t* byte_end); int qjson_cursor_object_entry_at(const qjson_cursor*, size_t i, const uint8_t** key_ptr, size_t* key_len, qjson_cursor* value_out); +int qjson_cursor_get_value(const qjson_cursor*, + int* type_out, + const uint8_t** str_ptr, size_t* str_len, + double* f64_out, int* bool_out, + size_t* byte_start, size_t* byte_end); #ifdef __cplusplus } diff --git a/lua/qjson/lib.lua b/lua/qjson/lib.lua index 3e6c686..49d2a04 100644 --- a/lua/qjson/lib.lua +++ b/lua/qjson/lib.lua @@ -41,6 +41,11 @@ int qjson_cursor_bytes(const qjson_cursor*, size_t* byte_start, size_t* byte_end int qjson_cursor_object_entry_at(const qjson_cursor*, size_t i, const uint8_t** key_ptr, size_t* key_len, qjson_cursor* value_out); +int qjson_cursor_get_value(const qjson_cursor*, + int* type_out, + const uint8_t** str_ptr, size_t* str_len, + double* f64_out, int* bool_out, + size_t* byte_start, size_t* byte_end); ]] local tried = {} @@ -70,6 +75,7 @@ local required_symbols = { "qjson_cursor_len", "qjson_cursor_bytes", "qjson_cursor_object_entry_at", + "qjson_cursor_get_value", } local function try_load(name) diff --git a/lua/qjson/table.lua b/lua/qjson/table.lua index 86f50d0..d731dc1 100644 --- a/lua/qjson/table.lua +++ b/lua/qjson/table.lua @@ -57,11 +57,9 @@ local LazyObject = {} local LazyArray = {} -- Build a new lazy view for a child container cursor. --- src_box is an FFI cdata `qjson_cursor[1]`; src_box[0] is the cursor whose --- data we copy into a fresh per-view allocation so the new view's _cur --- survives later overwrites of src_box. +-- src_box is an FFI cdata `qjson_cursor[1]`; the callers guarantee that sz_a/sz_b +-- already hold the cursor's byte span (filled by qjson_cursor_get_value). local function wrap_child(parent_view, src_box) - C.qjson_cursor_bytes(src_box[0], sz_a, sz_b) local own_box = ffi.new("qjson_cursor[1]") ffi.copy(own_box, src_box, ffi.sizeof("qjson_cursor")) return { @@ -74,23 +72,20 @@ local function wrap_child(parent_view, src_box) end -- Decode the value at src_box[0] into a Lua value. --- src_box is a `qjson_cursor[1]`; for container types, a new view is created --- via wrap_child so the caller's box can be freely reused afterwards. +-- src_box is a `qjson_cursor[1]`; uses qjson_cursor_get_value for a single FFI call. +-- For container types, a new view is created via wrap_child so the caller's box +-- can be freely reused afterwards. local function decode_cursor(parent_view, src_box) - local trc = C.qjson_cursor_typeof(src_box[0], "", 0, type_box) - if not check(trc) then return nil end + local rc = C.qjson_cursor_get_value(src_box[0], type_box, + strp_box, size_box, f64_box, bool_box, + sz_a, sz_b) + if not check(rc) then return nil end local t = type_box[0] if t == T_STR then - local rrc = C.qjson_cursor_get_str(src_box[0], "", 0, strp_box, size_box) - if not check(rrc) then return nil end return ffi.string(strp_box[0], size_box[0]) elseif t == T_NUM then - local rrc = C.qjson_cursor_get_f64(src_box[0], "", 0, f64_box) - if not check(rrc) then return nil end return f64_box[0] elseif t == T_BOOL then - local rrc = C.qjson_cursor_get_bool(src_box[0], "", 0, bool_box) - if not check(rrc) then return nil end return bool_box[0] ~= 0 elseif t == T_NULL then return _M.null @@ -316,17 +311,14 @@ function _M.decode(json_str) end local root_box = ffi.new("qjson_cursor[1]") ffi.copy(root_box, cur_box, ffi.sizeof("qjson_cursor")) - -- Determine root container kind (object/array) and wrap accordingly. - -- Both have meaningful byte spans for encode. - local trc = C.qjson_cursor_typeof(root_box[0], "", 0, type_box) - if not check(trc) then + -- Determine root container kind (object/array) and byte span in one call. + local grc = C.qjson_cursor_get_value(root_box[0], type_box, + strp_box, size_box, f64_box, bool_box, + sz_a, sz_b) + if not check(grc) then error("qjson: root typeof failed") end local rt = type_box[0] - local brc = C.qjson_cursor_bytes(root_box[0], sz_a, sz_b) - if not check(brc) then - error("qjson: root byte-span failed") - end local view = { _doc = doc, _cur_box = root_box, -- keep the box alive; _cur is a stable reference diff --git a/src/doc.rs b/src/doc.rs index 82226f5..cccf5f7 100644 --- a/src/doc.rs +++ b/src/doc.rs @@ -4,11 +4,11 @@ use crate::error::qjson_err; use crate::skip_cache::SkipCache; pub struct Document<'a> { - pub(crate) buf: &'a [u8], - pub(crate) indices: Vec, - pub(crate) eager_validated: bool, - pub(crate) scratch: RefCell>, - pub(crate) skip: RefCell, + pub(crate) buf: &'a [u8], + pub(crate) indices: Vec, + pub(crate) eager_validated: bool, + pub(crate) scratch: RefCell>, + pub(crate) skip: RefCell, } impl<'a> Document<'a> { diff --git a/src/ffi.rs b/src/ffi.rs index cbfb25a..522bff4 100644 --- a/src/ffi.rs +++ b/src/ffi.rs @@ -283,6 +283,15 @@ pub unsafe extern "C" fn qjson_get_str( } // String ends at the close quote, whose indices position is idx_start + 1. let close = d.indices[(cur.idx_start + 1) as usize] as usize; + let slice = &d.buf[pos + 1..close]; + + // EAGER fast path: validation already passed; if no escapes, return + // the buffer slice directly without touching scratch. + if d.eager_validated && memchr::memchr(b'\\', slice).is_none() { + *out_ptr = slice.as_ptr(); + *out_len = slice.len(); + return qjson_err::QJSON_OK as c_int; + } let mut scratch = d.scratch.borrow_mut(); match string::decode_string(d.buf, pos + 1, close, &mut scratch, d.eager_validated) { @@ -561,6 +570,13 @@ pub unsafe extern "C" fn qjson_cursor_get_str( return qjson_err::QJSON_TYPE_MISMATCH as c_int; } let close = d.indices[(cur.idx_start + 1) as usize] as usize; + let slice = &d.buf[pos + 1..close]; + + if d.eager_validated && memchr::memchr(b'\\', slice).is_none() { + *out_ptr = slice.as_ptr(); + *out_len = slice.len(); + return qjson_err::QJSON_OK as c_int; + } let mut scratch = d.scratch.borrow_mut(); match string::decode_string(d.buf, pos + 1, close, &mut scratch, d.eager_validated) { @@ -654,6 +670,101 @@ pub unsafe extern "C" fn qjson_cursor_get_bool( }) } +/// Resolve type + decoded value of a cursor in one FFI call. +/// On success (`QJSON_OK`), fills `*type_out` unconditionally. +/// For strings, fills `(*str_ptr, *str_len)`; +/// for numbers, fills `*f64_out`; for bool, fills `*bool_out`; +/// for containers, fills `(*byte_start, *byte_end)`. +/// +/// # Safety +/// +/// See the module-level [shared safety contract](self#shared-safety-contract). +/// `c` must point to a cursor produced by an earlier `qjson_*` call whose +/// document is still alive. All out pointers must be non-NULL. +#[no_mangle] +pub unsafe extern "C" fn qjson_cursor_get_value( + c: *const qjson_cursor, + type_out: *mut c_int, + str_ptr: *mut *const u8, str_len: *mut usize, + f64_out: *mut f64, bool_out: *mut c_int, + byte_start: *mut usize, byte_end: *mut usize, +) -> c_int { + ffi_catch!({ + if type_out.is_null() || str_ptr.is_null() || str_len.is_null() + || f64_out.is_null() || bool_out.is_null() + || byte_start.is_null() || byte_end.is_null() + { + return qjson_err::QJSON_INVALID_ARG as c_int; + } + let (d, cur) = match cursor_to_internal(c) { + Ok(x) => x, Err(e) => return e as c_int, + }; + let pos = d.indices[cur.idx_start as usize] as usize; + let lead = match d.buf.get(pos).copied() { + Some(b) => b, + None => return qjson_err::QJSON_PARSE_ERROR as c_int, + }; + match lead { + b'"' => { + *type_out = qjson_type::QJSON_T_STR as c_int; + let close = d.indices[(cur.idx_start + 1) as usize] as usize; + let slice = &d.buf[pos + 1..close]; + if d.eager_validated && memchr::memchr(b'\\', slice).is_none() { + *str_ptr = slice.as_ptr(); + *str_len = slice.len(); + } else { + let mut scratch = d.scratch.borrow_mut(); + match string::decode_string(d.buf, pos + 1, close, &mut scratch, d.eager_validated) { + Ok((p, n)) => { *str_ptr = p; *str_len = n; } + Err(e) => return e as c_int, + } + } + *byte_start = pos; + *byte_end = close + 1; + } + b'{' | b'[' => { + *type_out = if lead == b'{' { qjson_type::QJSON_T_OBJ as c_int } + else { qjson_type::QJSON_T_ARR as c_int }; + let end = d.indices[cur.idx_end as usize] as usize; + if end >= d.buf.len() { return qjson_err::QJSON_PARSE_ERROR as c_int; } + *byte_start = pos; + *byte_end = end + 1; + } + _ => { + let bytes = match scalar_bytes(d, cur) { + Ok(b) => b, Err(e) => return e as c_int, + }; + match bytes { + b"true" => { + *type_out = qjson_type::QJSON_T_BOOL as c_int; + *bool_out = 1; + } + b"false" => { + *type_out = qjson_type::QJSON_T_BOOL as c_int; + *bool_out = 0; + } + b"null" => { + *type_out = qjson_type::QJSON_T_NULL as c_int; + } + _ => { + *type_out = qjson_type::QJSON_T_NUM as c_int; + match number::parse_f64(bytes, d.eager_validated) { + Ok(v) => { *f64_out = v; } + Err(e) => return e as c_int, + } + } + } + let (s, e) = match scalar_byte_range(d, cur) { + Ok(x) => x, Err(e) => return e as c_int, + }; + *byte_start = s; + *byte_end = e; + } + } + qjson_err::QJSON_OK as c_int + }) +} + /// Write the JSON value type at `path` (relative to `*c`) into `*type_out`. /// /// # Safety