Add Rust reimplementation (zearch-rs) and benchmark script by pevalme · Pull Request #4 · pevalme/zearch

pevalme · 2026-03-16T06:55:58Z

Complete port of the zearch C tool to Rust, implementing the same
saturation construction algorithm for regex matching on Re-Pair
grammar-compressed text. Uses libfa via FFI for NFA construction.

Also adds benchmark/bench_rs_vs_c.sh to compare C vs Rust performance
across 4 corpus sizes and multiple regex patterns.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Complete port of the zearch C tool to Rust, implementing the same saturation construction algorithm for regex matching on Re-Pair grammar-compressed text. Uses libfa via FFI for NFA construction. Also adds benchmark/bench_rs_vs_c.sh to compare C vs Rust performance across 4 corpus sizes and multiple regex patterns. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Includes the Re-Pair compressor source code needed to generate .rp compressed test files for the zearch benchmark. Build artifacts (.o, repair, despair binaries) are gitignored. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Three structural changes that eliminate the performance gap: 1. Pack TransitionFull to 16 bytes (matching C's bitfield layout): count(25)|new_lines|is_there|match_|right|left|pairs_used(2) was 24 bytes, now 16 bytes — 33% smaller automaton array. 2. Flatten edges array: Vec<Vec<u16>> → Vec<u16> with manual indexing (edges[q * MAX_REGEX_SIZE + k]), eliminating 1024 separate heap allocations and pointer indirections. 3. Build config: LTO thin + target-cpu=native for full LLVM optimization including SIMD auto-vectorization. Also: unsafe unchecked indexing in the hottest inner loop (compose_left_edge) where bounds are statically guaranteed. Result: Rust now matches C at all sizes ≥ 100K lines, and ties or beats C at 500K-1M lines on some patterns. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Three tables are now deferred from ZearchState::new() to resize_after_automaton_init() called after fa_compile(): - recently_added: was 4MB (1024×1024×4) zero-fill at startup; now sized to num_states² (typically 400 bytes for 10-state NFA) - scratch_bitrow: was 8KB; now num_states*bpr*8 bytes - reached_states: was 2×4KB; now 2×num_states×4 bytes At 100K-1M text: tied with C or marginally faster (1.03-1.13x on complex patterns like [a-z]* and hello.*world). https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

At 5M-10M lines, Rust beats C by 5-33% on -c mode: 10M hello: 0.040s → 0.034s (1.18x) 5M foo.bar: 0.024s → 0.018s (1.33x) 10M [a-z]*: 0.037s → 0.036s (1.05x) https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Mirrors generate_table.sh's dataset types and all 8 regex patterns using synthetic data (no real download required). Sizes: 100KB, 1MB. Results at 1MB: - Gutenberg: Rust 1.07-1.12x faster on r1(what), r2(HTTP) - Subtitles: Rust 1.07-1.09x faster on r1(what), r7([a-z]{3}+) - Logs: tied (1.00-1.07x) across all patterns - Complex patterns (r5,r6): tied across all datasets At 100KB, timings are startup-dominated (2ms range), tied or noisy. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Results at 500MB (count mode): bench_rs_vs_c.sh (synthetic word corpus): hello: C 0.869s → Rust 0.822s (+6%) foo.bar: C 0.878s → Rust 0.816s (+8%) [a-z]*: C 0.915s → Rust 0.822s (+11%) hello.*world: C 1.171s → Rust 1.170s (tied) bench_datasets.sh (logs/gutenberg/subtitles, r1-r8): subtitles r1(what): C 2.022s → Rust 1.790s (+13%) subtitles r3(.): C 1.729s → Rust 1.589s (+9%) subtitles r6(date): C 1.646s → Rust 1.483s (+11%) gutenberg r3(.): C 1.199s → Rust 1.074s (+12%) logs r3(.): C 0.727s → Rust 0.632s (+15%) Complex patterns (r5 [a-z]{4}, r7 [a-z]*[a-z]{3}): tied or Rust marginally slower (<5%) on gutenberg due to larger NFA state count. -b hello at 500M: C wins (0.032s vs 0.094s) — early exit means startup and grammar-parse overhead dominate, not the search itself. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Documents the C vs Rust benchmark results across: - Three dataset types (logs, gutenberg, subtitles) × r1-r8 patterns - Sizes from 100KB to 500MB - Count mode (-c) and boolean mode (-b) Key findings: - At 500MB, Rust is 4-15% faster on simple patterns (small NFA) - Tied or slightly slower on complex patterns (large NFA, gutenberg) - Boolean early-exit dominated by startup overhead at small sizes Also documents the four algorithmic differences from the C version: compact recently_added, bitrow scratch buffer, deferred allocation, LTO+native. https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

claude added 8 commits March 15, 2026 08:10

Extend benchmark to 5M and 10M line files

4fe2d43

At 5M-10M lines, Rust beats C by 5-33% on -c mode: 10M hello: 0.040s → 0.034s (1.18x) 5M foo.bar: 0.024s → 0.018s (1.33x) 10M [a-z]*: 0.037s → 0.036s (1.05x) https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Rust reimplementation (zearch-rs) and benchmark script#4

Add Rust reimplementation (zearch-rs) and benchmark script#4
pevalme wants to merge 8 commits into
masterfrom
claude/zearch-rust-implementation-kRjbE

pevalme commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pevalme commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants