perf: reduce hot-path cache footprint and eliminate O(n) list traversal by pevalme · Pull Request #3 · pevalme/zearch

pevalme · 2026-02-20T17:28:55Z

Performance evaluation identified four opportunities; all are implemented:

Compact recently_added / edges arrays (biggest win)
recently_added[MAX_REGEX_SIZE][MAX_REGEX_SIZE] was a static 4 MB int array
and edges[MAX_REGEX_SIZE][MAX_REGEX_SIZE] a static 2 MB short array.
Both are now dynamically allocated as num_states×num_states after a fast
state-count pass on the compiled NFA. For a typical 20-state regex this
reduces the combined footprint from 6 MB to ~2.4 KB — the entire working
set of the saturation-construction inner loop fits in L1 cache instead of
causing repeated L2/L3 misses. Access syntax changes from [i][f] to the
equivalent flat index [i*num_states+f].
O(1) tail pointer for the overflow linked list (types.h / nfa.c)
TRANSITION_FULL gains last_block / last_index fields that always point to
the final PAIR node in the overflow list. add_edge() and add_edge_direct()
now jump directly to the tail instead of traversing the entire list, turning
repeated edge insertions for rules with many transitions from O(n²) to O(n).
BITIN_BUF_LEN 40 MB → 4 MB (bitin.h)
The read-ahead buffer for the bit-stream parser was pre-allocated at 40 MB
(10 485 760 × 4 B). Reducing to 4 MB (1 048 576 × 4 B) eliminates wasted
virtual memory; fread() refills the buffer transparently, so throughput is
unaffected for compressed files of any size.
BLOCKS_LENGTH 1 → 64 (memory.h)
The custom PAIR allocator's block-pointer array started with a single slot,
causing a realloc() every time a new 32 KB block was needed. Starting at
64 avoids all early reallocations (covers up to ~2 M pairs) at a one-time
cost of 512 B.

Other identified opportunities not yet implemented (documented for follow-up):

Dynamic allocation of dots / final_states / reached_states / list_dots
based on num_states (1D arrays; trivial but smaller impact than items 1-2).
Dirty-list tracking to replace the O(num_states) memset of num_edges at
the start of each add_rule() call.
Re-enable / complete the SIMD join path in simd.c (currently commented
out); the simd_find_in_vector infrastructure is already in place.
Recursive expand_match() converted to iterative to avoid deep-stack cost
on highly-compressed inputs.

https://claude.ai/code/session_01V88dSSBHw2a7MVygEwuDu1

Performance evaluation identified four opportunities; all are implemented: 1. Compact recently_added / edges arrays (biggest win) recently_added[MAX_REGEX_SIZE][MAX_REGEX_SIZE] was a static 4 MB int array and edges[MAX_REGEX_SIZE][MAX_REGEX_SIZE] a static 2 MB short array. Both are now dynamically allocated as num_states×num_states after a fast state-count pass on the compiled NFA. For a typical 20-state regex this reduces the combined footprint from 6 MB to ~2.4 KB — the entire working set of the saturation-construction inner loop fits in L1 cache instead of causing repeated L2/L3 misses. Access syntax changes from [i][f] to the equivalent flat index [i*num_states+f]. 2. O(1) tail pointer for the overflow linked list (types.h / nfa.c) TRANSITION_FULL gains last_block / last_index fields that always point to the final PAIR node in the overflow list. add_edge() and add_edge_direct() now jump directly to the tail instead of traversing the entire list, turning repeated edge insertions for rules with many transitions from O(n²) to O(n). 3. BITIN_BUF_LEN 40 MB → 4 MB (bitin.h) The read-ahead buffer for the bit-stream parser was pre-allocated at 40 MB (10 485 760 × 4 B). Reducing to 4 MB (1 048 576 × 4 B) eliminates wasted virtual memory; fread() refills the buffer transparently, so throughput is unaffected for compressed files of any size. 4. BLOCKS_LENGTH 1 → 64 (memory.h) The custom PAIR allocator's block-pointer array started with a single slot, causing a realloc() every time a new 32 KB block was needed. Starting at 64 avoids all early reallocations (covers up to ~2 M pairs) at a one-time cost of 512 B. Other identified opportunities not yet implemented (documented for follow-up): * Dynamic allocation of dots / final_states / reached_states / list_dots based on num_states (1D arrays; trivial but smaller impact than items 1-2). * Dirty-list tracking to replace the O(num_states) memset of num_edges at the start of each add_rule() call. * Re-enable / complete the SIMD join path in simd.c (currently commented out); the simd_find_in_vector infrastructure is already in place. * Recursive expand_match() converted to iterative to avoid deep-stack cost on highly-compressed inputs. https://claude.ai/code/session_01V88dSSBHw2a7MVygEwuDu1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce hot-path cache footprint and eliminate O(n) list traversal#3

perf: reduce hot-path cache footprint and eliminate O(n) list traversal#3
pevalme wants to merge 1 commit into
masterfrom
claude/zearch-performance-evaluation-BuO0z

pevalme commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pevalme commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants