Skip to content

Add Rust reimplementation (zearch-rs) and benchmark script#4

Open
pevalme wants to merge 8 commits into
masterfrom
claude/zearch-rust-implementation-kRjbE
Open

Add Rust reimplementation (zearch-rs) and benchmark script#4
pevalme wants to merge 8 commits into
masterfrom
claude/zearch-rust-implementation-kRjbE

Conversation

@pevalme

@pevalme pevalme commented Mar 16, 2026

Copy link
Copy Markdown
Owner

Complete port of the zearch C tool to Rust, implementing the same
saturation construction algorithm for regex matching on Re-Pair
grammar-compressed text. Uses libfa via FFI for NFA construction.

Also adds benchmark/bench_rs_vs_c.sh to compare C vs Rust performance
across 4 corpus sizes and multiple regex patterns.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5

claude added 8 commits March 15, 2026 08:10
Complete port of the zearch C tool to Rust, implementing the same
saturation construction algorithm for regex matching on Re-Pair
grammar-compressed text. Uses libfa via FFI for NFA construction.

Also adds benchmark/bench_rs_vs_c.sh to compare C vs Rust performance
across 4 corpus sizes and multiple regex patterns.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Includes the Re-Pair compressor source code needed to generate .rp
compressed test files for the zearch benchmark. Build artifacts
(.o, repair, despair binaries) are gitignored.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Three structural changes that eliminate the performance gap:

1. Pack TransitionFull to 16 bytes (matching C's bitfield layout):
   count(25)|new_lines|is_there|match_|right|left|pairs_used(2)
   was 24 bytes, now 16 bytes — 33% smaller automaton array.

2. Flatten edges array: Vec<Vec<u16>> → Vec<u16> with manual
   indexing (edges[q * MAX_REGEX_SIZE + k]), eliminating 1024
   separate heap allocations and pointer indirections.

3. Build config: LTO thin + target-cpu=native for full LLVM
   optimization including SIMD auto-vectorization.

Also: unsafe unchecked indexing in the hottest inner loop
(compose_left_edge) where bounds are statically guaranteed.

Result: Rust now matches C at all sizes ≥ 100K lines,
and ties or beats C at 500K-1M lines on some patterns.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Three tables are now deferred from ZearchState::new() to
resize_after_automaton_init() called after fa_compile():
  - recently_added: was 4MB (1024×1024×4) zero-fill at startup;
    now sized to num_states² (typically 400 bytes for 10-state NFA)
  - scratch_bitrow: was 8KB; now num_states*bpr*8 bytes
  - reached_states: was 2×4KB; now 2×num_states×4 bytes

At 100K-1M text: tied with C or marginally faster (1.03-1.13x on
complex patterns like [a-z]* and hello.*world).

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
At 5M-10M lines, Rust beats C by 5-33% on -c mode:
  10M hello:     0.040s → 0.034s (1.18x)
  5M  foo.bar:   0.024s → 0.018s (1.33x)
  10M [a-z]*:    0.037s → 0.036s (1.05x)

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Mirrors generate_table.sh's dataset types and all 8 regex patterns
using synthetic data (no real download required). Sizes: 100KB, 1MB.

Results at 1MB:
- Gutenberg: Rust 1.07-1.12x faster on r1(what), r2(HTTP)
- Subtitles: Rust 1.07-1.09x faster on r1(what), r7([a-z]{3}+)
- Logs: tied (1.00-1.07x) across all patterns
- Complex patterns (r5,r6): tied across all datasets

At 100KB, timings are startup-dominated (2ms range), tied or noisy.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Results at 500MB (count mode):

bench_rs_vs_c.sh (synthetic word corpus):
  hello:        C 0.869s → Rust 0.822s  (+6%)
  foo.bar:      C 0.878s → Rust 0.816s  (+8%)
  [a-z]*:       C 0.915s → Rust 0.822s  (+11%)
  hello.*world: C 1.171s → Rust 1.170s  (tied)

bench_datasets.sh (logs/gutenberg/subtitles, r1-r8):
  subtitles r1(what):  C 2.022s → Rust 1.790s (+13%)
  subtitles r3(.):     C 1.729s → Rust 1.589s (+9%)
  subtitles r6(date):  C 1.646s → Rust 1.483s (+11%)
  gutenberg r3(.):     C 1.199s → Rust 1.074s (+12%)
  logs r3(.):          C 0.727s → Rust 0.632s (+15%)

Complex patterns (r5 [a-z]{4}, r7 [a-z]*[a-z]{3}): tied or Rust
marginally slower (<5%) on gutenberg due to larger NFA state count.

-b hello at 500M: C wins (0.032s vs 0.094s) — early exit means
startup and grammar-parse overhead dominate, not the search itself.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Documents the C vs Rust benchmark results across:
- Three dataset types (logs, gutenberg, subtitles) × r1-r8 patterns
- Sizes from 100KB to 500MB
- Count mode (-c) and boolean mode (-b)

Key findings:
- At 500MB, Rust is 4-15% faster on simple patterns (small NFA)
- Tied or slightly slower on complex patterns (large NFA, gutenberg)
- Boolean early-exit dominated by startup overhead at small sizes

Also documents the four algorithmic differences from the C version:
compact recently_added, bitrow scratch buffer, deferred allocation,
LTO+native.

https://claude.ai/code/session_01AzaqtUp52dSP8EwXohtYA5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants