SIMD acceleration for rapidxml byte-scan hot paths#81
Draft
NelsonVides wants to merge 6 commits into
Draft
Conversation
std::vector's default std::allocator uses malloc/free, which bypasses BEAM memory bookkeeping. Introduce a custom enif_allocator<T> that delegates to enif_alloc/enif_free so all NIF heap usage is visible to erlang:memory() and crash dumps.
Replace std::back_insert_iterator<std::vector> with a custom PrintBuffer<Ch, Alloc, InitCap> that checks capacity once per operation and writes via raw pointers (memcpy/memset), enabling vectorization of the hot serialization path. Key changes in rapidxml_print.hpp: - PrintBuffer with pluggable std::allocator interface, pre-allocated initial capacity (template parameter), and bulk operations: append, append_fill, reserve_raw+advance, append_literal - Specialized print_* overloads taking PrintBuffer& directly, each doing a single reserve_raw() for the entire output (e.g. CDATA reserves 9+value+3 bytes in one call) then raw pointer writes - expand_chars with single worst-case reserve (len*6) and branchless entity expansion via raw pointer - append_literal<N> helper using array reference for compile-time size deduction on the generic OutIt path In exml.cpp: - node_to_binary uses thread_local NifPrintBuffer (backed by enif_allocator) that preserves capacity across calls - Single memcpy into enif_make_new_binary result
git-subtree-dir: c_src/simde git-subtree-split: 71fd833d9666141edcd1d3c109a80e228303d8d7
Add SIMDe-based 128-bit (SSE2/NEON) fast paths for rapidxml::xml_document::skip() and the longer delimiter scans (comment, CDATA, PI, declaration end). Each predicate dispatches to a SIMD specialization via int/long overload tag, falling back to the original scalar loop when no specialization exists. - New c_src/simd_skip.hpp with 11 predicate specializations, scalar prefix for short runs (< 16 bytes), and a single-movemask OR-reduction pattern - rapidxml.hpp: forward declarations + skip() dispatch hooks; trim trailing whitespace and comment/CDATA/PI/declaration scans wired to SIMD helpers - exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in bounds; parse_next adjusted to use the logical (pre-padding) buffer end - rebar.config: -I c_src/simde for exml_nif.so (baseline NIF exml_nif_base.so stays scalar for A/B benchmarking) Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2 (x86_64) and NEON (AArch64/Apple Silicon) without platform-specific ifdefs or compile flags.
e6ebf38 to
e1c1d96
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is controversial, ugly, vendors in SIMDe, it's a super verbose change, and C++26 is going to standardise most of this mess into hopefully something cleaner (though C++ has a history of making a worse mess of all great ideas, and compilers will be ready by like... 2028 if not later). But still, it brings from 0x (no changes, no regressions) to up to 5x improvements. Improvement grows the longer the string that doesn't need escapes (big bodies, CDATA, etc).
I'm experimenting with SIMD in general, might close this later :)
Add SIMDe-based 128-bit (SSE2/NEON) fast paths for
rapidxml::xml_document::skip() and the longer delimiter scans (comment,
CDATA, PI, declaration end). Each predicate dispatches to a SIMD
specialization via int/long overload tag, falling back to the original
scalar loop when no specialization exists.
prefix for short runs (< 16 bytes), and a single-movemask OR-reduction
pattern
trailing whitespace and comment/CDATA/PI/declaration scans wired to
SIMD helpers
bounds; parse_next adjusted to use the logical (pre-padding) buffer
end
exml_nif_base.so stays scalar for A/B benchmarking)
Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2
(x86_64) and NEON (AArch64/Apple Silicon) without platform-specific
ifdefs or compile flags.