perf: comprehensive Scala Native render pipeline optimization by He-Pin · Pull Request #776 · databricks/sjsonnet

He-Pin · 2026-04-13T07:16:41Z

Motivation:

This PR is the complete rebased render/string optimization branch for preserving the full #776 experiment on top of current master. It is intentionally broad and should remain draft/source material before we split the useful pieces into focused PRs.

Key Design Decision:

Keep the full rebased experiment available on GitHub first, then split from it.
Preserve compatibility-first behavior: any optimization that changes Jsonnet semantics must be excluded during later split PRs.
Treat this PR as an archive/integration branch, not as a ready-to-merge unit.
Use focused follow-up PRs for individual JIT/Native-friendly ideas with clean benchmark proof.

Modification:

Rebased the full render pipeline branch onto upstream/master at 0ae7b78a93c4e643f9bcfb6dd1d99d9fe7a522a9.
Resolved current renderer/SWAR conflicts across JVM, Native, JS, and shared renderer code.
Preserved the existing split/source ideas: SWAR/chunked escaping, char renderer paths, format scanning/assembly, ascii-safe propagation, stdlib string helpers, and manifest rendering work.
Included the rebase repair for the std.format no-spec/single-spec path so offset-based scanned formats do not create invalid null strings.

Benchmark Results:

JMH command:

./mill --no-server -j 1 bench.runRegressions \
  bench/resources/cpp_suite/large_string_template.jsonnet \
  bench/resources/cpp_suite/large_string_join.jsonnet \
  bench/resources/cpp_suite/bench.09.jsonnet \
  bench/resources/jdk17_suite/repeat_format.jsonnet \
  bench/resources/go_suite/manifestTomlEx.jsonnet \
  bench/resources/go_suite/manifestJsonEx.jsonnet \
  bench/resources/go_suite/substr.jsonnet \
  bench/resources/go_suite/parseInt.jsonnet

Case	master ms/op	#776 ms/op	Result
`large_string_template.jsonnet`	1.265	1.205	+5.0% ops/ms
`large_string_join.jsonnet`	0.554	0.397	+39.5% ops/ms
`bench.09.jsonnet`	0.042	0.046	regression/risk
`repeat_format.jsonnet`	0.202	0.170	+18.8% ops/ms
`manifestTomlEx.jsonnet`	0.068	0.073	regression/risk
`manifestJsonEx.jsonnet`	0.055	0.054	neutral/slightly positive
`substr.jsonnet`	0.057	0.048	+18.8% ops/ms
`parseInt.jsonnet`	0.032	0.033	neutral/slightly negative

Scala Native hyperfine, 30 runs, --shell=none, compared against source-built jrsonnet 0.5.0-pre98:

Case	master	#776	jrsonnet	Result
`large_string_template.jsonnet`	12.5 +/- 1.1 ms	12.0 +/- 1.4 ms	5.8 +/- 0.8 ms	PR slightly faster, jrsonnet still faster
`large_string_join.jsonnet`	8.0 +/- 0.7 ms	7.3 +/- 0.7 ms	8.4 +/- 5.4 ms	PR faster than master in this run
`repeat_format.jsonnet`	6.3 +/- 1.4 ms	5.2 +/- 0.4 ms	4.1 +/- 0.3 ms	PR faster, jrsonnet still faster
`manifestTomlEx.jsonnet`	5.6 +/- 0.9 ms	5.8 +/- 0.5 ms	3.5 +/- 0.7 ms	small PR regression/risk
`substr.jsonnet`	5.5 +/- 0.6 ms	5.6 +/- 0.5 ms	3.7 +/- 0.3 ms	neutral/slightly negative Native
`parseInt.jsonnet`	5.2 +/- 0.5 ms	5.1 +/- 0.4 ms	3.7 +/- 0.4 ms	neutral/slightly positive Native

Analysis:

The full branch has real positive signals, especially large_string_join, repeat_format, and some JVM string/format paths. It also carries broad code movement and measurable regressions/risks (bench.09, manifestTomlEx, and noisy Native short-run cases). Therefore this should stay draft and be used as the source for focused splits rather than merged wholesale.

References:

Current rebased head: He-Pin/sjsonnet@5d596ebb
Current base: 0ae7b78
Prior focused split: perf: skip builders for single-spec formats #833
Earlier split context: perf: chunk long string byte escaping #809

Result:

Complete rebase is pushed. Keep this PR draft/source material; next step is to split the positive pieces from this rebased head into smaller PRs with isolated tests and benchmark gates.

He-Pin · 2026-04-14T15:09:45Z

I think the string join can be improved with ast rewritten,but I want to do that after this got merged.

He-Pin · 2026-04-14T15:28:35Z

● Benchmark 结果汇总
  环境: Apple Silicon, macOS | 工具: hyperfine --warmup 5 --min-runs 20 -N sjsonnet: Scala Native (当前分支, 含 PR #776 优化) | jrsonnet: 0.5.0-pre98 (从源码编译)

  可靠基准 (>20ms 运行时间，启动开销不主导)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_primitives                     │                     37.6                     │                    214.5                     │             sjsonnet 5.71x 更快              │ sjsonnet
   inheritance_recursion                         │                     60.7                     │                    120.2                     │             sjsonnet 1.98x 更快              │ sjsonnet
   simple_recursive_call                         │                     28.8                     │                     52.6                     │             sjsonnet 1.83x 更快              │ sjsonnet
   realistic_2                                   │                     89.4                     │                    101.7                     │             sjsonnet 1.14x 更快              │ sjsonnet
   std_reverse                                   │                     21.6                     │                     23.5                     │                 持平 (1.09x)                 │ 持平

  中等规模 (10-20ms)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   std_base64_byte_array                         │                     9.8                      │                     18.2                     │             sjsonnet 1.86x 更快              │ sjsonnet
   std_base64decodebytes                         │                     14.1                     │                     20.5                     │             sjsonnet 1.45x 更快              │ sjsonnet
   big_object                                    │                     10.5                     │                     11.6                     │             sjsonnet 1.10x 更快              │ sjsonnet
   realistic_1                                   │                     9.3                      │                     11.9                     │             sjsonnet 1.27x 更快              │ sjsonnet

  小规模 (<10ms，启动开销主导)
   Benchmark                                     │                sjsonnet (ms)                 │                jrsonnet (ms)                 │                     比值                     │ 胜者
  ───────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────┼──────────────────────────────────────────────
   comparsion_for_array                          │                     6.3                      │                     12.8                     │             sjsonnet 2.02x 更快              │ sjsonnet
   foldl_string_concat                           │                     5.4                      │                     8.6                      │             sjsonnet 1.59x 更快              │ sjsonnet
   std_foldl                                     │                     6.2                      │                     7.4                      │             sjsonnet 1.19x 更快              │ sjsonnet
   large_string_join                             │                     6.8                      │                     5.4                      │             jrsonnet 1.26x 更快              │ jrsonnet
   array_sorts                                   │                     8.2                      │                     5.5                      │             jrsonnet 1.49x 更快              │ jrsonnet
   std_base64                                    │                     7.8                      │                     4.2                      │             jrsonnet 1.86x 更快              │ jrsonnet
   std_base64decode                              │                     7.3                      │                     5.3                      │             jrsonnet 1.36x 更快              │ jrsonnet
   std_manifestjsonex                            │                     6.4                      │                     4.1                      │             jrsonnet 1.54x 更快              │ jrsonnet
   std_manifesttomlex                            │                     6.5                      │                     3.6                      │             jrsonnet 1.82x 更快              │ jrsonnet
   std_parseint                                  │                     6.1                      │                     3.6                      │             jrsonnet 1.70x 更快              │ jrsonnet
   std_substr                                    │                     6.2                      │                     4.2                      │             jrsonnet 1.45x 更快              │ jrsonnet
   string_strips                                 │                     5.7                      │                     3.9                      │             jrsonnet 1.48x 更快              │ jrsonnet
   tail_call                                     │                     5.9                      │                     3.7                      │             jrsonnet 1.57x 更快              │ jrsonnet
   inheritance_function_recursion                │                     5.0                      │                     2.9                      │             jrsonnet 1.74x 更快              │ jrsonnet

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

He-Pin · 2026-04-26T11:36:17Z

Reviewed and keeping this as a follow-up, not part of this PR. #776 is scoped to the render/materialization pipeline and Scala Native-friendly SWAR/direct rendering paths. An AST rewrite for string join would be a separate optimization because it changes the optimization boundary earlier in the pipeline and should get its own focused benchmark/compatibility review.

Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.

He-Pin · 2026-04-26T20:24:34Z

Small follow-up pushed in 76d7bc4c:

Reuse _asciiSafe in MaterializeJsonRenderer, so known-safe strings skip the temporary char[] allocation and escape scan in the fused char rendering path.
Let std.length trust _asciiSafe before running the ASCII scan.
Fix the hand-written parseInt overflow path to preserve the previous Long.MinValue boundary while still rejecting positive overflow.

Validation run locally:

./mill --no-server 'sjsonnet.jvm[3.3.7].reformat'
./mill --no-server 'sjsonnet.jvm[3.3.7].checkFormat'
./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.Std0150FunctionsTests
./mill --no-server 'sjsonnet.jvm[3.3.7].test.testOnly' sjsonnet.RendererTests
./mill --no-server 'sjsonnet.jvm[3.3.7].test'
./mill --no-server 'sjsonnet.js[3.3.7].test'
./mill --no-server 'sjsonnet.native[3.3.7].compile'
git diff --check

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.

Motivation: Split the JMH-positive, JDK17/JIT/GC-friendly long-string rendering piece out of #776. Keep this PR focused on byte rendering for long strings that contain JSON escapes; this does not include the broader format, stdlib, compareStrings, or Scala Native experiments from #776. Modification: - Add `CharSWAR.findFirstEscapeChar(byte[], from, to)` on JVM, Scala.js, and Scala Native. - In `BaseByteRenderer`, keep the existing UTF-8 byte array for long strings, locate escape bytes, bulk-copy clean chunks with `System.arraycopy`, and escape only matching bytes inline. - Precompute the exact escaped output length, reserve `ByteBuilder` once, then write directly to the backing byte array. This removes repeated `ensureLength`/`appendUnsafeC` calls from the dirty long-string loop. - Use a static byte hex table for `\u00XX` control escapes. JIT / GC shape: - Hot code stays in simple `while` loops, `System.arraycopy`, and small private helpers. - No reflection, no internal JDK APIs, no closures/iterators in the rendering loop. - No per-chunk or per-escape objects are allocated by this follow-up; the existing per-long-string UTF-8 byte array remains the only temporary for this path. - I tested a no-allocation ASCII scalar path, but rejected it because it regressed `large_string_template` and `large_string_join` JMH. Notable results only: JMH target run, same machine, same command shape on `upstream/master` and this branch: `./mill -i bench.runRegressions bench/resources/cpp_suite/large_string_template.jsonnet bench/resources/cpp_suite/large_string_join.jsonnet` | Benchmark | upstream/master | PR | Delta | | --- | ---: | ---: | ---: | | `large_string_template` | 1.552 ms/op | 1.154 ms/op | -25.6% / 1.34x faster | Scala Native hyperfine, release-full native binary, 20 runs: | Benchmark | upstream/master | PR | Delta | | --- | ---: | ---: | ---: | | `large_string_template` | 10.5 +/- 0.2 ms | 9.6 +/- 0.3 ms | -8.6% / 1.09x faster | `large_string_join` was rechecked as a guardrail and stayed neutral, so it is intentionally omitted from the result tables. Verification: - `./mill -i 'sjsonnet.jvm[3.3.7].compile'` - `./mill -i 'sjsonnet.jvm[3.3.7].test'` - `./mill -i 'sjsonnet.js[3.3.7].compile' 'sjsonnet.native[3.3.7].compile'` - `./mill -i 'sjsonnet.native[3.3.7].nativeLink'` - `./mill -i __.checkFormat` - `git diff --check` - Focused JMH and Native hyperfine commands above References: - Split from #776 - Base: `b4c667d55d82d7c50c2103db967c33bebb0c2c98` - Head: `ff70b63e`

He-Pin · 2026-05-08T05:17:43Z

Closing obsolete broad draft. The useful render work has been or should be split into smaller focused PRs with current docs-aligned data; this branch is now conflicting and too broad to carry forward as-is.

He-Pin · 2026-05-08T05:21:39Z

Reopened. This broad branch still conflicts heavily with current renderer/SWAR code and overlaps later split PRs. Keep as draft/source material for extracting smaller PRs rather than closing it as negative.

He-Pin · 2026-05-08T05:35:53Z

Rebase retry against current upstream/master still conflicts at the first renderer/SWAR commit: sjsonnet/src-js/sjsonnet/CharSWAR.scala, sjsonnet/src-native/sjsonnet/CharSWAR.scala, and sjsonnet/src/sjsonnet/BaseByteRenderer.scala. Keeping this as draft/source material; not closing because this is not a negative benchmark result.

Motivation: String comparison (compareStringsByCodepoint) and long string rendering are hot paths in sort-heavy and render-heavy Jsonnet workloads. The comparison used per-char charAt() virtual dispatch preventing JIT vectorization. Long string rendering used a binary scan (clean→bulk copy, dirty→full reprocess from position 0). Modification: 1. compareStrings: bulk getChars() + tight array loop enabling JIT auto-vectorization (AVX2/SSE). Surrogate check deferred to mismatch point only (O(1) vs O(n)). ThreadLocal buffers on JVM, local alloc on Native, scalar fallback on JS. 2. findFirstEscapeChar: SWAR scan returning position (not boolean). 3. visitLongString: chunked rendering — find escape position, arraycopy clean prefix, escape inline, repeat. Avoids re-processing entire string when only a few chars need escaping. Result: All tests pass across JVM (Scala 3.3.7, 2.13.18) and JS. All benchmark regressions pass. Endian-safe (SWAR operates on independent byte lanes).

Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).

Motivation: manifestJsonEx/manifestTomlEx used the generic Visitor interface for char-based rendering, missing the fused direct-walk optimization that ByteRenderer already had. Additionally, char-based string rendering (BaseCharRenderer, MaterializeJsonRenderer) did binary hasEscapeChar check → char-by-char RenderUtils.escapeChar fallback, while ByteRenderer had proper chunked SWAR scanning → bulk arraycopy → inline escape. Modification: - Add materializeDirect(Val) to MaterializeJsonRenderer, mirroring ByteRenderer's fused materializer with valTag-based switch dispatch - Replace visitNonNullString in BaseCharRenderer with chunked rendering: findFirstEscapeCharChar → bulk arraycopy clean segments → escapeCharInline - Add renderQuotedString to MaterializeJsonRenderer with same chunked pattern - Add findFirstEscapeCharChar(char[]) to all 3 CharSWAR platform impls - Wire ManifestModule to use renderer.materializeDirect instead of Materializer.apply0 + Visitor interface Result: manifestJsonEx gap reduced from 2.15x to ~1.4x slower vs jrsonnet. realistic_2 flipped from 1.62x slower to 1.12x faster.

…afe propagation Motivation: String-heavy stdlib operations (substr, length, join, parseInt) had unnecessary overhead on Scala Native: codePointCount/offsetByCodePoints O(n) scans for ASCII strings, StringBuilder resize churn for join, exception-based parseInt via Long.parseLong. Modification: - Add ASCII fast path to Length and Substr using CharSWAR.isAllAscii: skip codePointCount/offsetByCodePoints for ASCII-only strings (99% case) - Pre-sized char[] assembly for std.join: two-pass approach calculates exact output length, then copies with getChars — zero resize overhead - Hand-written parseDigits loop for parseInt/parseOctal/parseHex: no exception setup, no intermediate allocation, single pass - Propagate _asciiSafe flag: parser sets it on ASCII string literals, Val.Str.concat preserves it when both children are ASCII-safe, join propagates it through all elements Result: substr gap reduced from 2.03x to ~1.07x. parseint from 1.80x to ~1.0x. large_string_join from 1.81x to ~1.27x. realistic_2 benefits from combined improvements.

Motivation: Format.format() used StringBuilder which starts small and resizes multiple times for large output. The large_string_template benchmark (591KB template, 256 interpolations) showed 2.78x gap vs jrsonnet. Modification: - Three-pass approach: compute formatted values into String array, calculate exact total output length, allocate char[] and copy with getChars — eliminates StringBuilder resize/copy overhead - Add direct Val dispatch in format loop: skip Materializer for common types (Str, Num, Bool, Null) to avoid ujson.Value roundtrip Result: large_string_template gap reduced from 2.78x to ~1.88x. Remaining gap is dominated by Scala Native startup overhead (~7ms vs Rust ~1ms); pure computation time is within ~1ms of jrsonnet.

Motivation: CI fails on two issues: (1) unused `alwaysinline` import in Native CharSWAR.scala, (2) `\uXXXX` sequences in comments are parsed as unicode escapes in Scala 2.12, causing compilation errors. Modification: - Remove unused `scala.scalanative.annotation.alwaysinline` import - Escape backslash-u sequences in comments across BaseByteRenderer and Renderer Result: Full test suite passes across all platforms and Scala versions

Motivation: Combined review of PR databricks#776 + databricks#778 identified ~130 lines of duplicated SWAR string rendering and long-to-char conversion code, plus two missing overflow checks in StringModule. Modification: - Extract renderQuotedStringSWAR as protected method in BaseCharRenderer, delegate from MaterializeJsonRenderer (removes ~60 lines duplication) - Make escapeCharInline protected, remove duplicate in Renderer - Consolidate Renderer.visitFloat64 onto inherited writeLongDirect, remove standalone RenderUtils.appendLong (~40 lines) - Add totalLen > Int.MaxValue guard in Join pre-sized allocation - Add Long overflow detection in parseDigits - Leverage _asciiSafe flag in Substr/Join to skip redundant scans Result: Net -132 lines. All tests pass across JVM/JS/Native/WASM.

Motivation: PR databricks#776 already propagates _asciiSafe through parser literals, base64, joins, and substrings, but MaterializeJsonRenderer still sent those known-safe strings through the chunked char renderer, allocating a temporary char array and scanning for escapes. The hand-written parseInt path also rejected Long.MinValue, which the previous Long.parseLong-based implementation accepted. Modification: Add a char-renderer fast path for known ASCII-safe strings and use it in fused MaterializeJsonRenderer. Let std.length trust _asciiSafe before scanning, and switch parseDigits to negative accumulation so Long.MinValue is accepted while positive overflow remains rejected. Result: Known ASCII-safe strings skip allocation and escape scanning in char materialization and std.length. parseInt keeps the overflow guard without regressing the Long.MinValue boundary.

Motivation: The full PR databricks#776 rebase introduced a duplicate JVM escape scan overload, missed the char renderer hex table, and allowed no-spec format strings to return null through the offset-based RuntimeFormat path. Modification: Remove the duplicate JVM byte-array escape scan, expose HEX_CHARS for BaseCharRenderer, and return the original source string for RuntimeFormat entries with no format specs. Result: The rebased branch compiles and the full cross-platform Mill test matrix passes locally. References: Upstream PR: databricks#776

Motivation: The rebased format optimization improved large templates but initially regressed the short repeat_format regression case because single-placeholder formats paid for an unnecessary final char-array assembly step. Modification: Return the already formatted value directly when a format string has exactly one spec and no static literal characters. Result: repeat_format improves from 0.188 ms/op on upstream master to 0.148 ms/op on this branch in the local JMH gate, while the full Mill test matrix remains green. References: Upstream PR: databricks#776

He-Pin · 2026-05-10T10:02:33Z

Complete rebase pushed to renderOpt-clean at 5d596ebb on top of current upstream/master (0ae7b78a). Local __.reformat + __.test passed, representative JMH and Native hyperfine data are now in the PR body. I kept the PR as draft/source material because the full branch still mixes positive signals with regression risks; next step is splitting from this rebased head.

Motivation: std.substr on long ASCII strings repeatedly pays codepoint-offset scans even when parser-time analysis can prove the literal is printable ASCII and JSON-render safe. Modification: Mark long ASCII JSON-safe literals with the existing _asciiSafe flag using a single platform CharSWAR scan, propagate the flag through string concatenation, and let std.length/std.substr use direct UTF-16 length/substring only for proven-safe values. Add UnicodeHandlingTests coverage for long ASCII length/substr boundaries and concat propagation. Result: Focused JVM JMH improves go_suite/substr from 0.056 ms/op to 0.046-0.047 ms/op with split_resolve unchanged and realistic2 in the same noise range. Scala Native hyperfine is neutral against master on the same case. References: Extracted from ideas in databricks#776, especially commit a190a80 (ASCII fast paths and asciiSafe propagation), narrowed to avoid the broader join/parseInt changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Motivation: std.substr on long ASCII strings repeatedly pays codepoint-offset scans even when parser-time analysis can prove the literal is printable ASCII and JSON-render safe. Modification: Mark long ASCII JSON-safe literals with the existing _asciiSafe flag using a single platform CharSWAR scan, propagate the flag through string concatenation, and let std.length/std.substr use direct UTF-16 length/substring only for proven-safe values. Add UnicodeHandlingTests coverage for long ASCII length/substr boundaries and concat propagation. Result: Focused JVM JMH improves go_suite/substr from 0.056 ms/op to 0.046-0.047 ms/op with split_resolve unchanged and realistic2 in the same noise range. Scala Native hyperfine is neutral against master on the same case. References: Extracted from ideas in databricks#776, especially commit a190a80 (ASCII fast paths and asciiSafe propagation), narrowed to avoid the broader join/parseInt changes.

Motivation: PR #776 showed that format-heavy workloads benefit when the format path avoids unnecessary intermediate assembly. This split keeps only the smallest safe idea: short format strings with exactly one specifier and no static literal text (for example `%08d`, `%010x`, `%-20s`, `%20s`) do not need a `StringBuilder` after the formatted value has already been computed. Key Design Decision: Keep the existing generic formatting implementation and all validation/arity checks. The optimization only bypasses appending to `StringBuilder` after the single formatted value is known, so format semantics and error behavior stay unchanged. Modification: - Detect `specBits.length == 1 && parsed.staticChars == 0` in `Format.format`. - Avoid allocating/appending a `StringBuilder` for that case. - Return the computed formatted value directly after the existing too-many/too-few value checks. Benchmark Results: JMH (`./mill -j 1 bench.runRegressions ...`, ms/op lower is better; ops/ms higher is better): | Case | master ms/op | PR ms/op | master ops/ms | PR ops/ms | Delta | |---|---:|---:|---:|---:|---:| | `repeat_format` | 0.190 | 0.133 | 5.263 | 7.519 | +42.9% | | `large_string_template` guard | 1.155 | 1.160 | 0.866 | 0.862 | -0.4% noisy/neutral | Scala Native hyperfine (`hyperfine --warmup 10 --min-runs 50 -N`, ms lower is better): | Case | master native | PR native | jrsonnet | Result | |---|---:|---:|---:|---| | `repeat_format` | 6.4 ± 0.9 ms | 6.4 ± 0.7 ms | 5.6 ± 1.0 ms | Native neutral; JMH target positive | | `large_string_template` guard | 12.6 ± 7.4 ms | 11.4 ± 0.8 ms | 5.4 ± 0.9 ms | No Native regression observed | Analysis: The target case is dominated by many short format expressions. Returning the already computed formatted string removes a redundant builder allocation/append path on the JVM. The guard case does not use this single-spec/no-static-literal path and remains effectively unchanged within benchmark noise. References: - Source idea: #776 - Split branch commit: He-Pin/sjsonnet@1f58504d Result: - `./mill -j 1 __.reformat && ./mill -j 1 __.test` passed locally. - Draft PR split from the broader #776 optimization branch to keep the change reviewable and avoid carrying unrelated Native template work.

Motivation: std.substr on long ASCII strings repeatedly pays codepoint-offset scans even when parser-time analysis can prove the literal is printable ASCII and JSON-render safe. Modification: Mark long ASCII JSON-safe literals with the existing _asciiSafe flag using a single platform CharSWAR scan, propagate the flag through string concatenation, and let std.length/std.substr use direct UTF-16 length/substring only for proven-safe values. Add UnicodeHandlingTests coverage for long ASCII length/substr boundaries and concat propagation. Result: Focused JVM JMH improves go_suite/substr from 0.056 ms/op to 0.046-0.047 ms/op with split_resolve unchanged and realistic2 in the same noise range. Scala Native hyperfine is neutral against master on the same case. References: Extracted from ideas in databricks#776, especially commit a190a80 (ASCII fast paths and asciiSafe propagation), narrowed to avoid the broader join/parseInt changes.

Motivation: The fused renderer fallback entered the generic materializer with a fresh context, losing active object cycle tracking once the recursive depth limit was reached. Modification: Expose the stackless materializer fallback inside sjsonnet and route char/byte direct renderers through it with the existing MaterializeContext. Add regression coverage for manifestJson and ByteRenderer with a low recursive depth limit. Result: Deep direct rendering preserves recursion detection while retaining the stackless fallback path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

He-Pin force-pushed the renderOpt-clean branch from 5512f52 to 3042124 Compare April 13, 2026 07:28

He-Pin marked this pull request as draft April 13, 2026 07:50

He-Pin mentioned this pull request Apr 13, 2026

perf: SIMD-accelerated FastBase64 for Scala Native via C FFI #749

Merged

He-Pin force-pushed the renderOpt-clean branch from 3042124 to 3ac67a1 Compare April 14, 2026 14:16

He-Pin changed the title ~~perf: SWAR string comparison and chunked escape rendering~~ perf: comprehensive Scala Native render pipeline optimization Apr 14, 2026

He-Pin marked this pull request as ready for review April 14, 2026 14:19

He-Pin commented Apr 14, 2026

View reviewed changes

Comment thread sjsonnet/src-js/sjsonnet/CharSWAR.scala

He-Pin marked this pull request as draft April 14, 2026 17:32

He-Pin force-pushed the renderOpt-clean branch from a4dde27 to e38e8c4 Compare April 18, 2026 09:59

He-Pin force-pushed the renderOpt-clean branch from bf0e393 to 58759aa Compare April 25, 2026 08:41

He-Pin force-pushed the renderOpt-clean branch from 58759aa to 2e42f76 Compare April 26, 2026 10:47

He-Pin closed this Apr 26, 2026

He-Pin reopened this Apr 26, 2026

He-Pin marked this pull request as ready for review April 26, 2026 11:05

He-Pin marked this pull request as draft April 26, 2026 11:10

He-Pin marked this pull request as ready for review April 26, 2026 11:19

He-Pin marked this pull request as draft April 26, 2026 20:22

He-Pin force-pushed the renderOpt-clean branch from 76d7bc4 to ef97244 Compare April 28, 2026 22:10

He-Pin mentioned this pull request Apr 30, 2026

perf: chunk long string byte escaping #809

Merged

He-Pin closed this May 8, 2026

He-Pin reopened this May 8, 2026

He-Pin and others added 13 commits May 10, 2026 12:40

perf: use pre-allocated char buffers for Native compareStrings

8c1219f

Replace per-call `new Array[Char](n)` allocation with module-level pre-allocated buffers in Scala Native's compareStrings. Safe because Scala Native is single-threaded (mirrors the JVM ThreadLocal approach).

style: apply scalafmt to CharSWAR Scala sources

0501961

test: drop stale parseInt overflow expectation

a853f3a

perf: avoid temp char arrays for clean strings

11d1733

He-Pin mentioned this pull request May 10, 2026

perf: skip builders for single-spec formats #833

Merged

He-Pin force-pushed the renderOpt-clean branch from ef97244 to 5d596eb Compare May 10, 2026 10:01

He-Pin mentioned this pull request May 10, 2026

perf: add ASCII-safe substr fast path #834

Open

He-Pin mentioned this pull request May 11, 2026

perf: stack validated kube-prometheus optimizations #836

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: comprehensive Scala Native render pipeline optimization#776

perf: comprehensive Scala Native render pipeline optimization#776
He-Pin wants to merge 14 commits intodatabricks:masterfrom
He-Pin:renderOpt-clean

He-Pin commented Apr 13, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

He-Pin commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

Uh oh!

He-Pin commented Apr 14, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented Apr 26, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 8, 2026

Uh oh!

He-Pin commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

He-Pin commented Apr 13, 2026 •

edited

Loading