Add Rust reimplementation (zearch-rs) and benchmark script#4
Open
pevalme wants to merge 8 commits into
Open
Conversation
Complete port of the zearch C tool to Rust, implementing the same saturation construction algorithm for regex matching on Re-Pair grammar-compressed text. Uses libfa via FFI for NFA construction. Also adds benchmark/bench_rs_vs_c.sh to compare C vs Rust performance across 4 corpus sizes and multiple regex patterns. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Includes the Re-Pair compressor source code needed to generate .rp compressed test files for the zearch benchmark. Build artifacts (.o, repair, despair binaries) are gitignored. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Three structural changes that eliminate the performance gap: 1. Pack TransitionFull to 16 bytes (matching C's bitfield layout): count(25)|new_lines|is_there|match_|right|left|pairs_used(2) was 24 bytes, now 16 bytes — 33% smaller automaton array. 2. Flatten edges array: Vec<Vec<u16>> → Vec<u16> with manual indexing (edges[q * MAX_REGEX_SIZE + k]), eliminating 1024 separate heap allocations and pointer indirections. 3. Build config: LTO thin + target-cpu=native for full LLVM optimization including SIMD auto-vectorization. Also: unsafe unchecked indexing in the hottest inner loop (compose_left_edge) where bounds are statically guaranteed. Result: Rust now matches C at all sizes ≥ 100K lines, and ties or beats C at 500K-1M lines on some patterns. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Three tables are now deferred from ZearchState::new() to
resize_after_automaton_init() called after fa_compile():
- recently_added: was 4MB (1024×1024×4) zero-fill at startup;
now sized to num_states² (typically 400 bytes for 10-state NFA)
- scratch_bitrow: was 8KB; now num_states*bpr*8 bytes
- reached_states: was 2×4KB; now 2×num_states×4 bytes
At 100K-1M text: tied with C or marginally faster (1.03-1.13x on
complex patterns like [a-z]* and hello.*world).
https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
At 5M-10M lines, Rust beats C by 5-33% on -c mode: 10M hello: 0.040s → 0.034s (1.18x) 5M foo.bar: 0.024s → 0.018s (1.33x) 10M [a-z]*: 0.037s → 0.036s (1.05x) https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Mirrors generate_table.sh's dataset types and all 8 regex patterns
using synthetic data (no real download required). Sizes: 100KB, 1MB.
Results at 1MB:
- Gutenberg: Rust 1.07-1.12x faster on r1(what), r2(HTTP)
- Subtitles: Rust 1.07-1.09x faster on r1(what), r7([a-z]{3}+)
- Logs: tied (1.00-1.07x) across all patterns
- Complex patterns (r5,r6): tied across all datasets
At 100KB, timings are startup-dominated (2ms range), tied or noisy.
https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Results at 500MB (count mode):
bench_rs_vs_c.sh (synthetic word corpus):
hello: C 0.869s → Rust 0.822s (+6%)
foo.bar: C 0.878s → Rust 0.816s (+8%)
[a-z]*: C 0.915s → Rust 0.822s (+11%)
hello.*world: C 1.171s → Rust 1.170s (tied)
bench_datasets.sh (logs/gutenberg/subtitles, r1-r8):
subtitles r1(what): C 2.022s → Rust 1.790s (+13%)
subtitles r3(.): C 1.729s → Rust 1.589s (+9%)
subtitles r6(date): C 1.646s → Rust 1.483s (+11%)
gutenberg r3(.): C 1.199s → Rust 1.074s (+12%)
logs r3(.): C 0.727s → Rust 0.632s (+15%)
Complex patterns (r5 [a-z]{4}, r7 [a-z]*[a-z]{3}): tied or Rust
marginally slower (<5%) on gutenberg due to larger NFA state count.
-b hello at 500M: C wins (0.032s vs 0.094s) — early exit means
startup and grammar-parse overhead dominate, not the search itself.
https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Documents the C vs Rust benchmark results across: - Three dataset types (logs, gutenberg, subtitles) × r1-r8 patterns - Sizes from 100KB to 500MB - Count mode (-c) and boolean mode (-b) Key findings: - At 500MB, Rust is 4-15% faster on simple patterns (small NFA) - Tied or slightly slower on complex patterns (large NFA, gutenberg) - Boolean early-exit dominated by startup overhead at small sizes Also documents the four algorithmic differences from the C version: compact recently_added, bitrow scratch buffer, deferred allocation, LTO+native. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Complete port of the zearch C tool to Rust, implementing the same
saturation construction algorithm for regex matching on Re-Pair
grammar-compressed text. Uses libfa via FFI for NFA construction.
Also adds benchmark/bench_rs_vs_c.sh to compare C vs Rust performance
across 4 corpus sizes and multiple regex patterns.
https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5