feat(index): implement BM25 relevance scoring#95
Conversation
Replace binary token-overlap scoring with BM25 (Best Matching 25) probabilistic scoring for match/term/phrase queries. - Add bm25_idf() for inverse document frequency computation - Add compute_avg_field_length() for field length normalization - Add extract_query_terms() to collect query terms for IDF precomputation - score_match_query now uses full BM25 formula with TF from postings and document frequency from PositionsReader across all segments - search() precomputes IDF map and avg field length before scoring - Update match_queries_find_tokens_in_text_fields test to accept BM25 scores (non-exact since scoring now considers term frequency and document frequency, not just token overlap) - Fix clippy collapsible_if in compute_avg_field_length - Add #[allow(clippy::too_many_arguments)] to scoring functions
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughSearch ranking switched to BM25-style scoring: the search entry now builds a BM25 context (k1, b, per-query-term IDF, average field length) and passes it through revised scoring functions so match/term/phrase/boolean queries compute BM25 TF/IDF with field-length normalization. ChangesBM25 Search Ranking Enhancement
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 805-806: The BM25 normalization currently computes avg_field_len
using a hardcoded "content" which misnormalizes scores for other fields; in
score_match_query replace the literal "content" with the actual field from the
query (query.field) when calling compute_avg_field_length (and keep the existing
.max(1.0) fallback), so avg_field_len is computed per the field being scored and
normalization is correct; locate compute_avg_field_length and score_match_query
references to update the argument to use query.field (with a safe fallback if
query.field can be empty).
- Around line 794-804: The IDF precompute loop uses extract_query_terms(query,
"") which strips field names and thus fails to extract terms for fielded
Match/Phrase queries, leaving idf_map empty; change the extraction to preserve
the query's field context so term lookup against self.positions_readers works:
call extract_query_terms with the actual field (or a variant that returns
fielded terms) for the incoming Query used by BM25, then compute total_df by
iterating readers and summing pl.docs.len(), compute idf via bm25_idf and insert
into idf_map as before (ensure the functions extract_query_terms, idf_map,
bm25_idf, and positions_readers are used consistently so Match/Phrase terms are
included).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5171ac10-b9ef-484c-9ce3-0775dcfc7673
📒 Files selected for processing (1)
rust/crates/cloudsearch-index/src/lib.rs
- Add extract_target_field() helper to get field name from query - Use actual query field (not hardcoded "content") for avg_field_len - Deduplicate query_terms via BTreeSet before IDF loop
- Add Bm25Context struct holding idf_map, avg_field_len, k1, b to eliminate clippy::too_many_arguments warnings on scoring functions - Add score_term_query() for non-fuzzy exact term queries using BM25; falls back to binary scoring for non-string fields (bool, numeric) - score_phrase_query() now sums per-term BM25 scores instead of returning 1.0 for valid phrases; gap penalty still applies - Update score_query, score_match_query, score_bool_query to use Bm25Context instead of 4 separate parameters - Collapse two nested if-let chains per clippy::collapsible_if
Terms never seen in the corpus should contribute 0 to the BM25 score, not 1.0. When a term is missing from the precomputed idf_map it means df=0, and the correct IDF contribution is 0.
Bm25Context now stores avg_field_lens: BTreeMap<String, f32> instead of a single avg_field_len. Each field in searchable_documents gets its own average field length computed and stored at search time. All scoring functions (score_match_query, score_term_query, score_phrase_query) now pass the query field to bm25_term_score, which looks up the field-specific avg_field_len via get_avg_field_len. extract_target_field is still called but result is unused for avg_field_len computation (it's now computed per-field). Marked with underscore prefix to suppress warning.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)
794-808:⚠️ Potential issue | 🟠 Major | ⚡ Quick winIDF precompute is still disabled for Match/Phrase queries.
The call
extract_query_terms(query, "")at line 798 passes an empty string astarget_field. Looking atextract_query_terms(lines 1831-1836), Match and Phrase queries have guards checkingmq.field == target_fieldandpq.field == target_field. Since actual query fields (e.g., "message", "content") never equal "", these branches never match, so no terms are extracted.This means
idf_mapremains empty for Match/Phrase query terms, causing IDF to default to 0.0 (line 1958), which makes BM25 scores effectively 0 for all matching documents.The existing
get_query_termsfunction (lines 2323-2349) extracts terms from Match/Phrase queries unconditionally and should be used instead.Proposed fix
- let _target_field = extract_target_field(query); - let mut idf_map: std::collections::BTreeMap<String, f32> = - std::collections::BTreeMap::new(); - let query_terms: std::collections::BTreeSet<String> = - extract_query_terms(query, "").into_iter().collect(); + let mut idf_map: std::collections::BTreeMap<String, f32> = + std::collections::BTreeMap::new(); + let query_terms: std::collections::BTreeSet<String> = + get_query_terms(query).into_iter().collect();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 794 - 808, The IDF precompute is using extract_query_terms(query, "") which passes an empty target_field so Match/Phrase branches never match; replace that call with get_query_terms(query) (or pass the actual target_field from extract_target_field(query) into extract_query_terms) so query_terms correctly contains Match/Phrase terms before computing idf_map with bm25_idf; update the query_terms initialization used by the idf_map loop (and keep idf_map, bm25_idf, and positions_readers usage unchanged).
🧹 Nitpick comments (3)
rust/crates/cloudsearch-index/src/lib.rs (3)
1817-1826: 💤 Low value
extract_target_fieldis unused and can be removed.This function is called at line 794 but the result is assigned to
_target_fieldwhich is never used. Per the PR description, this is intentional to suppress warnings, but the function and call should be removed entirely to avoid dead code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1817 - 1826, The function extract_target_field is dead code and its sole invocation (where the result is assigned to _target_field) is intentionally unused; remove the extract_target_field function definition from lib.rs and delete the corresponding call that assigns to _target_field so there is no unused function or unused value; ensure no other references to extract_target_field remain and run cargo build to confirm no missing imports or references.
809-825: 🏗️ Heavy liftPer-field average length computed for all fields on every search.
This code iterates over all documents to collect all unique field names, then for each field calls
compute_avg_field_lengthwhich iterates over all documents again. For N documents and F unique fields, this is O(N × F) work on every search call.Consider computing avg_field_lens only for fields actually used in the query (extractable via
extract_target_fieldor similar), or caching this computation and updating incrementally on index modifications.Sketch of optimization
// Only compute for the field(s) used by the query let query_field = extract_target_field(query); let mut avg_field_lens: std::collections::BTreeMap<String, f32> = std::collections::BTreeMap::new(); avg_field_lens.insert( query_field.clone(), compute_avg_field_length(&self.searchable_documents, &query_field).max(1.0), ); // For Bool queries, may need to extract multiple fields🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 809 - 825, The current search builds avg_field_lens for every field by scanning self.searchable_documents for all fields and then calling compute_avg_field_length per field, which is O(N×F); change it to only compute averages for fields actually referenced by the query (use extract_target_field or similar to collect target fields from the query/Bool clauses) and build avg_field_lens just for those fields, or implement an incremental cache updated on index changes keyed by field name; update the code paths that reference avg_field_lens (the variable avg_field_lens and calls to compute_avg_field_length) to use the reduced set (or cached values) and ensure a .max(1.0) clamp is still applied.
5731-5753: ⚡ Quick winTest assertions don't validate BM25 scoring behavior.
The tests only assert
score >= 0.0which passes even when BM25 scores are 0 (due to the IDF precompute bug). Consider adding assertions that verify non-zero scores for matching documents, or at least that scores increase with term frequency:// Example: verify score is meaningfully positive for a match assert!(hello.hits.hits[0].score.unwrap_or(0.0) > 0.0, "BM25 score should be positive for matching documents");This would help catch regressions in the scoring logic.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 5731 - 5753, The test currently only asserts scores are >= 0.0 for the MatchQuery results (variables hello and both), which won’t catch a zero-IDF bug; update the assertions around handle.search / SearchRequest / MatchQuery to assert that matching documents have meaningfully positive BM25 scores (e.g., score.unwrap_or(0.0) > 0.0) and where appropriate assert ordering of scores (e.g., doc-1 score > doc-2 score for the shorter field length case) so the test fails if BM25 returns zeros or non-monotonic scores.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1828-1854: The extract_query_terms function currently skips Match
and Phrase extraction due to the guards using the target_field parameter; remove
the field-equality guards (the "if mq.field == target_field" and "if pq.field ==
target_field" checks) or eliminate the target_field parameter entirely so Match
and Phrase branches always call tokenize; alternatively replace callers to use
get_query_terms (which already extracts across fields) instead of
extract_query_terms. Update the extract_query_terms signature/usages (and any
code invoking it) so Match, Phrase, Term (exact) and Bool branches reliably
return terms regardless of target_field, keeping the Term branch behavior for
fuzziness and the Bool branch recursion.
---
Duplicate comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 794-808: The IDF precompute is using extract_query_terms(query,
"") which passes an empty target_field so Match/Phrase branches never match;
replace that call with get_query_terms(query) (or pass the actual target_field
from extract_target_field(query) into extract_query_terms) so query_terms
correctly contains Match/Phrase terms before computing idf_map with bm25_idf;
update the query_terms initialization used by the idf_map loop (and keep
idf_map, bm25_idf, and positions_readers usage unchanged).
---
Nitpick comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1817-1826: The function extract_target_field is dead code and its
sole invocation (where the result is assigned to _target_field) is intentionally
unused; remove the extract_target_field function definition from lib.rs and
delete the corresponding call that assigns to _target_field so there is no
unused function or unused value; ensure no other references to
extract_target_field remain and run cargo build to confirm no missing imports or
references.
- Around line 809-825: The current search builds avg_field_lens for every field
by scanning self.searchable_documents for all fields and then calling
compute_avg_field_length per field, which is O(N×F); change it to only compute
averages for fields actually referenced by the query (use extract_target_field
or similar to collect target fields from the query/Bool clauses) and build
avg_field_lens just for those fields, or implement an incremental cache updated
on index changes keyed by field name; update the code paths that reference
avg_field_lens (the variable avg_field_lens and calls to
compute_avg_field_length) to use the reduced set (or cached values) and ensure a
.max(1.0) clamp is still applied.
- Around line 5731-5753: The test currently only asserts scores are >= 0.0 for
the MatchQuery results (variables hello and both), which won’t catch a zero-IDF
bug; update the assertions around handle.search / SearchRequest / MatchQuery to
assert that matching documents have meaningfully positive BM25 scores (e.g.,
score.unwrap_or(0.0) > 0.0) and where appropriate assert ordering of scores
(e.g., doc-1 score > doc-2 score for the shorter field length case) so the test
fails if BM25 returns zeros or non-monotonic scores.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5573ccb2-96f1-4f32-b17a-c2fc53781eaa
📒 Files selected for processing (1)
rust/crates/cloudsearch-index/src/lib.rs
- Add BM25 scoring for TermsQuery (OR over multiple values), using max IDF among matching terms - Add SearchQuery::Terms to extract_query_terms for proper IDF computation - Fix bm25_idf(0, n_docs) to return 1.0 instead of 0.0 (avoids zero scores when postings are empty after refresh but before flush) - Change IDF fallback from 0.0 to 1.0 across all scoring functions - Handle numeric and boolean values in TermsQuery via value equality - Change phrase gap factor from multiplier to divisor to avoid amplifying large BM25 term scores
- Remove unused _target_field variable and extract_target_field function (was never used since extract_query_terms is called with "") - Remove per-token averaging from score_match_query: BM25 already normalizes via TF saturation, so dividing by matched token count distorted scores
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)
2098-2196:⚠️ Potential issue | 🟠 Major | ⚡ Quick win
score_phrase_queryrecomputes BM25 per matching position; hoist it out of the position loop.The inner BM25 sum at lines 2162–2181 is rebuilt for every
first_posof the first term in this document, buttf,doc_len,idf, andphrase.fieldare all invariant within the document — onlymax_gapvaries. For a document where the first term appears N times, this is N × (terms × readers) redundant work, plus an extratokenize(field_str)per fallback path. Worse, the innerfor reader in positions_readerson line 2165 shadows the outerreaderand re-iterates every segment for every term, every position.Compute the BM25 term-score sum once per document (outside the position loops), iterate occurrences only to find the smallest
max_gap, then apply the gap divisor once.♻️ Sketch of the refactor
// Compute the doc-level BM25 sum once. let mut term_score_sum = 0.0f32; for term in &query_tokens { let mut tf = 0u32; for r in positions_readers { if let Some(pl) = r.get(term) && let Ok(idx) = pl.docs.binary_search_by(|p| p.doc_id.cmp(&doc_id)) { tf += pl.docs[idx].term_freq; } } if tf == 0 { tf = field_tokens.iter().filter(|t| *t == term).count() as u32; } if tf == 0 { continue; } let idf = bm25_ctx.idf_map.get(term).copied().unwrap_or(1.0); term_score_sum += bm25_ctx.bm25_term_score(tf, doc_len, idf, &phrase.field); } // Then walk positions only to find the best (smallest) max_gap across phrase matches, // and apply the divisor once at the end.Side-note: line 2107 linearly scans
posting_list.docsfor the matchingdoc_ideven though every other BM25 path here usesbinary_search_by(|p| p.doc_id.cmp(&doc_id)). Worth aligning for consistency and O(log n) lookup.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 2098 - 2196, In score_phrase_query, the BM25 term_score_sum is recomputed inside the per-position loop (and the inner `for reader in positions_readers` shadows the outer `reader`), causing N× redundant work; move the BM25 accumulation out so you compute term_score_sum once per document (iterate over `query_tokens` and `positions_readers` to compute tf → idf → bm25_term_score, with the same fallback using `field_tokens`), then only iterate positions to find the best/minimum `max_gap` for that document and apply the gap divisor once to the precomputed term_score_sum before updating best_score; also replace the linear scan for the document in the posting_list with the existing binary_search_by pattern used elsewhere to avoid O(n) scans.
♻️ Duplicate comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)
794-807:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winIDF precompute still broken for
Match/Phrasequeries (regression of prior fix).Line 797 still passes
""astarget_field, and the match arms inextract_query_terms(lines 1821, 1824) keep theif mq.field == target_field/if pq.field == target_fieldguards. Since no real field is ever named"", Match and Phrase terms are never extracted,idf_mapstays empty for these queries, andscore_match_query/score_phrase_queryfall back to the constantidf = 1.0at lines 1954, 2179. The end-to-end BM25 IDF the PR advertises is effectively disabled for the most common query types.This was previously flagged and marked "✅ Addressed", but the current diff has not actually changed the call site or the guards.
🔧 Suggested fix
Either drop the
target_fieldparameter+guards so Match/Phrase always tokenize, or reuse the unguardedget_query_termsfor the IDF precompute:- let query_terms: std::collections::BTreeSet<String> = - extract_query_terms(query, "").into_iter().collect(); + let query_terms: std::collections::BTreeSet<String> = + get_query_terms(query).into_iter().collect();And, if
extract_query_termshas no other callers after this change, remove it (and itstarget_fieldparameter) to avoid future foot-guns.Also applies to: 1818-1850
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 794 - 807, The IDF precompute is calling extract_query_terms(query, "") so Match/Phrase terms are never collected due to the target_field guards; change the IDF precompute to use the unguarded get_query_terms(query) (or call extract_query_terms without a field guard by removing the target_field check), so idf_map is populated for Match/Phrase queries used by score_match_query and score_phrase_query; if you switch to get_query_terms and extract_query_terms is no longer used, remove the redundant extract_query_terms signature/guards to avoid future regressions.
🧹 Nitpick comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)
1805-1816: 💤 Low valueDocstring overstates when
bm25_idfreturns 0.The doc comment says it returns 0 "when df > N (shouldn't happen in practice)", but the actual
.max(0.0)clamp also fires for any term appearing in more than half the corpus (whenever(N - df + 0.5) / (df + 0.5) < 1, i.e.df > (N - 0.5) / 2). That's a common, expected case — stopword-like terms in small corpora — and is exactly what the new test at lines 5778–5780 already acknowledges. Worth updating the comment so future readers don't chase a phantom invariant.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1805 - 1816, The doc comment for bm25_idf is inaccurate about when the function returns 0; update the docs to reflect that the `.max(0.0)` clamp will produce 0 whenever the computed IDF is negative (e.g., for terms appearing in roughly more than half the corpus, not only the impossible df > N case), while preserving the current behavior that df == 0 returns a default 1.0; edit the comment above fn bm25_idf to state these exact conditions (df == 0 -> returns 1.0; otherwise compute log((N - df + 0.5)/(df + 0.5)) and clamp negatives to 0) so future readers aren’t misled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 789-807: The IDF calculation mixes corpora: n_docs is taken from
searchable_documents while total_df is summed from positions_readers (populated
only on flush), causing inconsistent IDF after refresh; fix by making N and DF
come from the same source — either (A) compute total_df directly from
searchable_documents by checking/tokenizing each document for presence of each
term (use the same tokenization as extract_query_terms and
per_doc_inverted_index) and keep n_docs = searchable_documents.len().max(1), or
(B) compute both N and DF from positions_readers (set n_docs =
sum(reader.docs.len() for reader in self.positions_readers).max(1)) so
bm25_idf(total_df, n_docs) uses consistent corpora; update the block around
extract_query_terms, idf_map, and bm25_idf accordingly and ensure behavior is
consistent with flush/refresh semantics referenced in flush and refresh.
---
Outside diff comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 2098-2196: In score_phrase_query, the BM25 term_score_sum is
recomputed inside the per-position loop (and the inner `for reader in
positions_readers` shadows the outer `reader`), causing N× redundant work; move
the BM25 accumulation out so you compute term_score_sum once per document
(iterate over `query_tokens` and `positions_readers` to compute tf → idf →
bm25_term_score, with the same fallback using `field_tokens`), then only iterate
positions to find the best/minimum `max_gap` for that document and apply the gap
divisor once to the precomputed term_score_sum before updating best_score; also
replace the linear scan for the document in the posting_list with the existing
binary_search_by pattern used elsewhere to avoid O(n) scans.
---
Duplicate comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 794-807: The IDF precompute is calling extract_query_terms(query,
"") so Match/Phrase terms are never collected due to the target_field guards;
change the IDF precompute to use the unguarded get_query_terms(query) (or call
extract_query_terms without a field guard by removing the target_field check),
so idf_map is populated for Match/Phrase queries used by score_match_query and
score_phrase_query; if you switch to get_query_terms and extract_query_terms is
no longer used, remove the redundant extract_query_terms signature/guards to
avoid future regressions.
---
Nitpick comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1805-1816: The doc comment for bm25_idf is inaccurate about when
the function returns 0; update the docs to reflect that the `.max(0.0)` clamp
will produce 0 whenever the computed IDF is negative (e.g., for terms appearing
in roughly more than half the corpus, not only the impossible df > N case),
while preserving the current behavior that df == 0 returns a default 1.0; edit
the comment above fn bm25_idf to state these exact conditions (df == 0 ->
returns 1.0; otherwise compute log((N - df + 0.5)/(df + 0.5)) and clamp
negatives to 0) so future readers aren’t misled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: af4932e1-65bf-43c3-b1a5-baddc87e2d7f
📒 Files selected for processing (1)
rust/crates/cloudsearch-index/src/lib.rs
- Add BM25 scoring for Prefix and Wildcard queries using per-token scoring - Use binary match (matches_prefix_query/matches_wildcard_query) as initial filter before BM25 scoring to preserve matching behavior - Prefix scoring: iterate field tokens, score those starting with the prefix - Wildcard scoring: iterate field tokens, score those matching the wildcard pattern - For prefix queries, pre-compute IDF for the full stored value in the idf_map since the prefix check uses the full field value first
When a field token matches the prefix but isn't in idf_map directly, fall back to the full stored value's IDF (pre-computed in search()). This ensures prefix scores reflect actual term rarity rather than defaulting all unmatched tokens to IDF=1.0.
Apply cargo fmt and fix clippy issues: - Use `or_insert` with entry API for contains_key + insert pattern - Use `str::to_lowercase` instead of closure - Add `#[allow(clippy::cast_precision_loss)]` to numeric cast-heavy functions - Use `u32::try_from(...).unwrap_or(0)` for safe truncation - Collapse nested if with `&& let` pattern - Add `#[allow(clippy::too_many_lines)]` on search function - Use `k.clone()` instead of `k.to_string()` - Fix trailing comma formatting on match arms
Summary
bm25_idf()for inverse document frequency using the standard Lucene formula:ln((N - df + 0.5) / (df + 0.5))compute_avg_field_length()to compute average field length for field length normalization (b=0.75)score_match_querynow uses the full BM25 formula with term frequency from postings and document frequency fromPositionsReaderacross all segmentssearch()precomputes IDF map and average field length before scoring, then threads them through the scoring chaink1=1.2(TF saturation parameter) andb=0.75(field length normalization) are the standard Lucene defaultsTest plan
cargo test --workspace— 333 tests passcargo clippy --workspace --all-targets -- -D warnings— cleanSummary by CodeRabbit
Improvements
Tests