perf: reduce hot-path cache footprint and eliminate O(n) list traversal#3
Open
pevalme wants to merge 1 commit into
Open
perf: reduce hot-path cache footprint and eliminate O(n) list traversal#3pevalme wants to merge 1 commit into
pevalme wants to merge 1 commit into
Conversation
Performance evaluation identified four opportunities; all are implemented:
1. Compact recently_added / edges arrays (biggest win)
recently_added[MAX_REGEX_SIZE][MAX_REGEX_SIZE] was a static 4 MB int array
and edges[MAX_REGEX_SIZE][MAX_REGEX_SIZE] a static 2 MB short array.
Both are now dynamically allocated as num_states×num_states after a fast
state-count pass on the compiled NFA. For a typical 20-state regex this
reduces the combined footprint from 6 MB to ~2.4 KB — the entire working
set of the saturation-construction inner loop fits in L1 cache instead of
causing repeated L2/L3 misses. Access syntax changes from [i][f] to the
equivalent flat index [i*num_states+f].
2. O(1) tail pointer for the overflow linked list (types.h / nfa.c)
TRANSITION_FULL gains last_block / last_index fields that always point to
the final PAIR node in the overflow list. add_edge() and add_edge_direct()
now jump directly to the tail instead of traversing the entire list, turning
repeated edge insertions for rules with many transitions from O(n²) to O(n).
3. BITIN_BUF_LEN 40 MB → 4 MB (bitin.h)
The read-ahead buffer for the bit-stream parser was pre-allocated at 40 MB
(10 485 760 × 4 B). Reducing to 4 MB (1 048 576 × 4 B) eliminates wasted
virtual memory; fread() refills the buffer transparently, so throughput is
unaffected for compressed files of any size.
4. BLOCKS_LENGTH 1 → 64 (memory.h)
The custom PAIR allocator's block-pointer array started with a single slot,
causing a realloc() every time a new 32 KB block was needed. Starting at
64 avoids all early reallocations (covers up to ~2 M pairs) at a one-time
cost of 512 B.
Other identified opportunities not yet implemented (documented for follow-up):
* Dynamic allocation of dots / final_states / reached_states / list_dots
based on num_states (1D arrays; trivial but smaller impact than items 1-2).
* Dirty-list tracking to replace the O(num_states) memset of num_edges at
the start of each add_rule() call.
* Re-enable / complete the SIMD join path in simd.c (currently commented
out); the simd_find_in_vector infrastructure is already in place.
* Recursive expand_match() converted to iterative to avoid deep-stack cost
on highly-compressed inputs.
https://claude.ai/code/session_01V88dSSBHw2a7MVygEwuDu1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Performance evaluation identified four opportunities; all are implemented:
Compact recently_added / edges arrays (biggest win)
recently_added[MAX_REGEX_SIZE][MAX_REGEX_SIZE] was a static 4 MB int array
and edges[MAX_REGEX_SIZE][MAX_REGEX_SIZE] a static 2 MB short array.
Both are now dynamically allocated as num_states×num_states after a fast
state-count pass on the compiled NFA. For a typical 20-state regex this
reduces the combined footprint from 6 MB to ~2.4 KB — the entire working
set of the saturation-construction inner loop fits in L1 cache instead of
causing repeated L2/L3 misses. Access syntax changes from [i][f] to the
equivalent flat index [i*num_states+f].
O(1) tail pointer for the overflow linked list (types.h / nfa.c)
TRANSITION_FULL gains last_block / last_index fields that always point to
the final PAIR node in the overflow list. add_edge() and add_edge_direct()
now jump directly to the tail instead of traversing the entire list, turning
repeated edge insertions for rules with many transitions from O(n²) to O(n).
BITIN_BUF_LEN 40 MB → 4 MB (bitin.h)
The read-ahead buffer for the bit-stream parser was pre-allocated at 40 MB
(10 485 760 × 4 B). Reducing to 4 MB (1 048 576 × 4 B) eliminates wasted
virtual memory; fread() refills the buffer transparently, so throughput is
unaffected for compressed files of any size.
BLOCKS_LENGTH 1 → 64 (memory.h)
The custom PAIR allocator's block-pointer array started with a single slot,
causing a realloc() every time a new 32 KB block was needed. Starting at
64 avoids all early reallocations (covers up to ~2 M pairs) at a one-time
cost of 512 B.
Other identified opportunities not yet implemented (documented for follow-up):
based on num_states (1D arrays; trivial but smaller impact than items 1-2).
the start of each add_rule() call.
out); the simd_find_in_vector infrastructure is already in place.
on highly-compressed inputs.
https://claude.ai/code/session_01V88dSSBHw2a7MVygEwuDu1