SIMD acceleration for rapidxml byte-scan hot paths by NelsonVides · Pull Request #81 · esl/exml

NelsonVides · 2026-04-09T19:52:22Z

This is controversial, ugly, vendors in SIMDe, it's a super verbose change, and C++26 is going to standardise most of this mess into hopefully something cleaner (though C++ has a history of making a worse mess of all great ideas, and compilers will be ready by like... 2028 if not later). But still, it brings from 0x (no changes, no regressions) to up to 5x improvements. Improvement grows the longer the string that doesn't need escapes (big bodies, CDATA, etc).

I'm experimenting with SIMD in general, might close this later :)

Add SIMDe-based 128-bit (SSE2/NEON) fast paths for
rapidxml::xml_document::skip() and the longer delimiter scans (comment,
CDATA, PI, declaration end). Each predicate dispatches to a SIMD
specialization via int/long overload tag, falling back to the original
scalar loop when no specialization exists.

New c_src/simd_skip.hpp with 11 predicate specializations, scalar
prefix for short runs (< 16 bytes), and a single-movemask OR-reduction
pattern
rapidxml.hpp: forward declarations + skip() dispatch hooks; trim
trailing whitespace and comment/CDATA/PI/declaration scans wired to
SIMD helpers
exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in
bounds; parse_next adjusted to use the logical (pre-padding) buffer
end
rebar.config: -I c_src/simde for exml_nif.so (baseline NIF
exml_nif_base.so stays scalar for A/B benchmarking)

Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2
(x86_64) and NEON (AArch64/Apple Silicon) without platform-specific
ifdefs or compile flags.

std::vector's default std::allocator uses malloc/free, which bypasses BEAM memory bookkeeping. Introduce a custom enif_allocator<T> that delegates to enif_alloc/enif_free so all NIF heap usage is visible to erlang:memory() and crash dumps.

Replace std::back_insert_iterator<std::vector> with a custom PrintBuffer<Ch, Alloc, InitCap> that checks capacity once per operation and writes via raw pointers (memcpy/memset), enabling vectorization of the hot serialization path. Key changes in rapidxml_print.hpp: - PrintBuffer with pluggable std::allocator interface, pre-allocated initial capacity (template parameter), and bulk operations: append, append_fill, reserve_raw+advance, append_literal - Specialized print_* overloads taking PrintBuffer& directly, each doing a single reserve_raw() for the entire output (e.g. CDATA reserves 9+value+3 bytes in one call) then raw pointer writes - expand_chars with single worst-case reserve (len*6) and branchless entity expansion via raw pointer - append_literal<N> helper using array reference for compile-time size deduction on the generic OutIt path In exml.cpp: - node_to_binary uses thread_local NifPrintBuffer (backed by enif_allocator) that preserves capacity across calls - Single memcpy into enif_make_new_binary result

git-subtree-dir: c_src/simde git-subtree-split: 71fd833d9666141edcd1d3c109a80e228303d8d7

Add SIMDe-based 128-bit (SSE2/NEON) fast paths for rapidxml::xml_document::skip() and the longer delimiter scans (comment, CDATA, PI, declaration end). Each predicate dispatches to a SIMD specialization via int/long overload tag, falling back to the original scalar loop when no specialization exists. - New c_src/simd_skip.hpp with 11 predicate specializations, scalar prefix for short runs (< 16 bytes), and a single-movemask OR-reduction pattern - rapidxml.hpp: forward declarations + skip() dispatch hooks; trim trailing whitespace and comment/CDATA/PI/declaration scans wired to SIMD helpers - exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in bounds; parse_next adjusted to use the logical (pre-padding) buffer end - rebar.config: -I c_src/simde for exml_nif.so (baseline NIF exml_nif_base.so stays scalar for A/B benchmarking) Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2 (x86_64) and NEON (AArch64/Apple Silicon) without platform-specific ifdefs or compile flags.

NelsonVides added 6 commits April 9, 2026 11:14

benchmark

dcfb2b5

Squashed 'c_src/simde/' content from commit 71fd833d

36efeed

git-subtree-dir: c_src/simde git-subtree-split: 71fd833d9666141edcd1d3c109a80e228303d8d7

Merge commit '36efeed5d183e204d6f65399a4fc6f01a8742cb0' as 'c_src/simde'

dab142b

NelsonVides self-assigned this Apr 9, 2026

NelsonVides added the WIP Don't review WIP-s! label Apr 9, 2026

DenysGonchar force-pushed the perf/to_binary branch from e6ebf38 to e1c1d96 Compare May 14, 2026 13:49

Base automatically changed from perf/to_binary to master May 14, 2026 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD acceleration for rapidxml byte-scan hot paths#81

SIMD acceleration for rapidxml byte-scan hot paths#81
NelsonVides wants to merge 6 commits into
masterfrom
perf/simd_parse

NelsonVides commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NelsonVides commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant