Skip to content

perf: reduce hot-path cache footprint and eliminate O(n) list traversal#3

Open
pevalme wants to merge 1 commit into
masterfrom
claude/zearch-performance-evaluation-BuO0z
Open

perf: reduce hot-path cache footprint and eliminate O(n) list traversal#3
pevalme wants to merge 1 commit into
masterfrom
claude/zearch-performance-evaluation-BuO0z

Conversation

@pevalme

@pevalme pevalme commented Feb 20, 2026

Copy link
Copy Markdown
Owner

Performance evaluation identified four opportunities; all are implemented:

  1. Compact recently_added / edges arrays (biggest win)
    recently_added[MAX_REGEX_SIZE][MAX_REGEX_SIZE] was a static 4 MB int array
    and edges[MAX_REGEX_SIZE][MAX_REGEX_SIZE] a static 2 MB short array.
    Both are now dynamically allocated as num_states×num_states after a fast
    state-count pass on the compiled NFA. For a typical 20-state regex this
    reduces the combined footprint from 6 MB to ~2.4 KB — the entire working
    set of the saturation-construction inner loop fits in L1 cache instead of
    causing repeated L2/L3 misses. Access syntax changes from [i][f] to the
    equivalent flat index [i*num_states+f].

  2. O(1) tail pointer for the overflow linked list (types.h / nfa.c)
    TRANSITION_FULL gains last_block / last_index fields that always point to
    the final PAIR node in the overflow list. add_edge() and add_edge_direct()
    now jump directly to the tail instead of traversing the entire list, turning
    repeated edge insertions for rules with many transitions from O(n²) to O(n).

  3. BITIN_BUF_LEN 40 MB → 4 MB (bitin.h)
    The read-ahead buffer for the bit-stream parser was pre-allocated at 40 MB
    (10 485 760 × 4 B). Reducing to 4 MB (1 048 576 × 4 B) eliminates wasted
    virtual memory; fread() refills the buffer transparently, so throughput is
    unaffected for compressed files of any size.

  4. BLOCKS_LENGTH 1 → 64 (memory.h)
    The custom PAIR allocator's block-pointer array started with a single slot,
    causing a realloc() every time a new 32 KB block was needed. Starting at
    64 avoids all early reallocations (covers up to ~2 M pairs) at a one-time
    cost of 512 B.

Other identified opportunities not yet implemented (documented for follow-up):

  • Dynamic allocation of dots / final_states / reached_states / list_dots
    based on num_states (1D arrays; trivial but smaller impact than items 1-2).
  • Dirty-list tracking to replace the O(num_states) memset of num_edges at
    the start of each add_rule() call.
  • Re-enable / complete the SIMD join path in simd.c (currently commented
    out); the simd_find_in_vector infrastructure is already in place.
  • Recursive expand_match() converted to iterative to avoid deep-stack cost
    on highly-compressed inputs.

https://claude.ai/code/session_01V88dSSBHw2a7MVygEwuDu1

Performance evaluation identified four opportunities; all are implemented:

1. Compact recently_added / edges arrays  (biggest win)
   recently_added[MAX_REGEX_SIZE][MAX_REGEX_SIZE] was a static 4 MB int array
   and edges[MAX_REGEX_SIZE][MAX_REGEX_SIZE] a static 2 MB short array.
   Both are now dynamically allocated as num_states×num_states after a fast
   state-count pass on the compiled NFA.  For a typical 20-state regex this
   reduces the combined footprint from 6 MB to ~2.4 KB — the entire working
   set of the saturation-construction inner loop fits in L1 cache instead of
   causing repeated L2/L3 misses.  Access syntax changes from [i][f] to the
   equivalent flat index [i*num_states+f].

2. O(1) tail pointer for the overflow linked list  (types.h / nfa.c)
   TRANSITION_FULL gains last_block / last_index fields that always point to
   the final PAIR node in the overflow list.  add_edge() and add_edge_direct()
   now jump directly to the tail instead of traversing the entire list, turning
   repeated edge insertions for rules with many transitions from O(n²) to O(n).

3. BITIN_BUF_LEN 40 MB → 4 MB  (bitin.h)
   The read-ahead buffer for the bit-stream parser was pre-allocated at 40 MB
   (10 485 760 × 4 B).  Reducing to 4 MB (1 048 576 × 4 B) eliminates wasted
   virtual memory; fread() refills the buffer transparently, so throughput is
   unaffected for compressed files of any size.

4. BLOCKS_LENGTH 1 → 64  (memory.h)
   The custom PAIR allocator's block-pointer array started with a single slot,
   causing a realloc() every time a new 32 KB block was needed.  Starting at
   64 avoids all early reallocations (covers up to ~2 M pairs) at a one-time
   cost of 512 B.

Other identified opportunities not yet implemented (documented for follow-up):
  * Dynamic allocation of dots / final_states / reached_states / list_dots
    based on num_states (1D arrays; trivial but smaller impact than items 1-2).
  * Dirty-list tracking to replace the O(num_states) memset of num_edges at
    the start of each add_rule() call.
  * Re-enable / complete the SIMD join path in simd.c (currently commented
    out); the simd_find_in_vector infrastructure is already in place.
  * Recursive expand_match() converted to iterative to avoid deep-stack cost
    on highly-compressed inputs.

https://claude.ai/code/session_01V88dSSBHw2a7MVygEwuDu1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants