Skip to content

SIMD acceleration for rapidxml byte-scan hot paths#81

Draft
NelsonVides wants to merge 6 commits into
masterfrom
perf/simd_parse
Draft

SIMD acceleration for rapidxml byte-scan hot paths#81
NelsonVides wants to merge 6 commits into
masterfrom
perf/simd_parse

Conversation

@NelsonVides
Copy link
Copy Markdown
Collaborator

This is controversial, ugly, vendors in SIMDe, it's a super verbose change, and C++26 is going to standardise most of this mess into hopefully something cleaner (though C++ has a history of making a worse mess of all great ideas, and compilers will be ready by like... 2028 if not later). But still, it brings from 0x (no changes, no regressions) to up to 5x improvements. Improvement grows the longer the string that doesn't need escapes (big bodies, CDATA, etc).

I'm experimenting with SIMD in general, might close this later :)


Add SIMDe-based 128-bit (SSE2/NEON) fast paths for
rapidxml::xml_document::skip() and the longer delimiter scans (comment,
CDATA, PI, declaration end). Each predicate dispatches to a SIMD
specialization via int/long overload tag, falling back to the original
scalar loop when no specialization exists.

  • New c_src/simd_skip.hpp with 11 predicate specializations, scalar
    prefix for short runs (< 16 bytes), and a single-movemask OR-reduction
    pattern
  • rapidxml.hpp: forward declarations + skip() dispatch hooks; trim
    trailing whitespace and comment/CDATA/PI/declaration scans wired to
    SIMD helpers
  • exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in
    bounds; parse_next adjusted to use the logical (pre-padding) buffer
    end
  • rebar.config: -I c_src/simde for exml_nif.so (baseline NIF
    exml_nif_base.so stays scalar for A/B benchmarking)

Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2
(x86_64) and NEON (AArch64/Apple Silicon) without platform-specific
ifdefs or compile flags.

std::vector's default std::allocator uses malloc/free, which bypasses
BEAM memory bookkeeping. Introduce a custom enif_allocator<T> that
delegates to enif_alloc/enif_free so all NIF heap usage is visible to
erlang:memory() and crash dumps.
Replace std::back_insert_iterator<std::vector> with a custom
PrintBuffer<Ch, Alloc, InitCap> that checks capacity once per operation
and writes via raw pointers (memcpy/memset), enabling vectorization of
the hot serialization path.

Key changes in rapidxml_print.hpp:
- PrintBuffer with pluggable std::allocator interface, pre-allocated
  initial capacity (template parameter), and bulk operations: append,
  append_fill, reserve_raw+advance, append_literal
- Specialized print_* overloads taking PrintBuffer& directly, each doing
  a single reserve_raw() for the entire output (e.g. CDATA reserves
  9+value+3 bytes in one call) then raw pointer writes
- expand_chars with single worst-case reserve (len*6) and branchless
  entity expansion via raw pointer
- append_literal<N> helper using array reference for compile-time size
  deduction on the generic OutIt path

In exml.cpp:
- node_to_binary uses thread_local NifPrintBuffer (backed by
  enif_allocator) that preserves capacity across calls
- Single memcpy into enif_make_new_binary result
git-subtree-dir: c_src/simde
git-subtree-split: 71fd833d9666141edcd1d3c109a80e228303d8d7
Add SIMDe-based 128-bit (SSE2/NEON) fast paths for
rapidxml::xml_document::skip() and the longer delimiter scans (comment,
CDATA, PI, declaration end). Each predicate dispatches to a SIMD
specialization via int/long overload tag, falling back to the original
scalar loop when no specialization exists.

- New c_src/simd_skip.hpp with 11 predicate specializations, scalar
  prefix for short runs (< 16 bytes), and a single-movemask OR-reduction
  pattern
- rapidxml.hpp: forward declarations + skip() dispatch hooks; trim
  trailing whitespace and comment/CDATA/PI/declaration scans wired to
  SIMD helpers
- exml.cpp: 15-byte zero tail padding so 16-byte SIMD loads stay in
  bounds; parse_next adjusted to use the logical (pre-padding) buffer
  end
- rebar.config: -I c_src/simde for exml_nif.so (baseline NIF
  exml_nif_base.so stays scalar for A/B benchmarking)

Uses 128-bit lane width via SIMDe so the same code maps 1:1 to both SSE2
(x86_64) and NEON (AArch64/Apple Silicon) without platform-specific
ifdefs or compile flags.
@NelsonVides NelsonVides self-assigned this Apr 9, 2026
@NelsonVides NelsonVides added the WIP Don't review WIP-s! label Apr 9, 2026
Base automatically changed from perf/to_binary to master May 14, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

WIP Don't review WIP-s!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant