Skip to content

feat(index): implement BM25 relevance scoring#95

Open
poyrazK wants to merge 10 commits into
mainfrom
feature/bm25-scoring
Open

feat(index): implement BM25 relevance scoring#95
poyrazK wants to merge 10 commits into
mainfrom
feature/bm25-scoring

Conversation

@poyrazK
Copy link
Copy Markdown
Owner

@poyrazK poyrazK commented May 7, 2026

Summary

  • Replace binary token-overlap scoring with BM25 probabilistic relevance scoring for match/term/phrase queries
  • Add bm25_idf() for inverse document frequency using the standard Lucene formula: ln((N - df + 0.5) / (df + 0.5))
  • Add compute_avg_field_length() to compute average field length for field length normalization (b=0.75)
  • score_match_query now uses the full BM25 formula with term frequency from postings and document frequency from PositionsReader across all segments
  • search() precomputes IDF map and average field length before scoring, then threads them through the scoring chain
  • k1=1.2 (TF saturation parameter) and b=0.75 (field length normalization) are the standard Lucene defaults

Test plan

  • cargo test --workspace — 333 tests pass
  • cargo clippy --workspace --all-targets -- -D warnings — clean

Summary by CodeRabbit

  • Improvements

    • Search ranking now uses BM25-style scoring for more relevant results — improved TF/IDF weighting, field-length normalization, phrase scoring using BM25 term contributions, and BM25-aware boolean clause evaluation. Added robust term-frequency aggregation with fallbacks when index postings are incomplete.
  • Tests

    • Updated tests to assert stable hit ordering and non-negative scores consistent with BM25 behavior.

Review Change Stack

Replace binary token-overlap scoring with BM25 (Best Matching 25)
probabilistic scoring for match/term/phrase queries.

- Add bm25_idf() for inverse document frequency computation
- Add compute_avg_field_length() for field length normalization
- Add extract_query_terms() to collect query terms for IDF precomputation
- score_match_query now uses full BM25 formula with TF from postings
  and document frequency from PositionsReader across all segments
- search() precomputes IDF map and avg field length before scoring
- Update match_queries_find_tokens_in_text_fields test to accept
  BM25 scores (non-exact since scoring now considers term frequency
  and document frequency, not just token overlap)
- Fix clippy collapsible_if in compute_avg_field_length
- Add #[allow(clippy::too_many_arguments)] to scoring functions
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Search ranking switched to BM25-style scoring: the search entry now builds a BM25 context (k1, b, per-query-term IDF, average field length) and passes it through revised scoring functions so match/term/phrase/boolean queries compute BM25 TF/IDF with field-length normalization.

Changes

BM25 Search Ranking Enhancement

Layer / File(s) Summary
Search entry: build Bm25Context
rust/crates/cloudsearch-index/src/lib.rs
IndexHandle::search computes BM25 constants, aggregates DF across positions_readers, computes average field lengths from stored docs, constructs Bm25Context, and passes it into scoring.
BM25 helper types and scoring core
rust/crates/cloudsearch-index/src/lib.rs
Adds Bm25Context, bm25_idf, extract_query_terms, compute_avg_field_length, and updates score_query to accept Bm25Context; routes queries to BM25-aware scoring implementations.
Phrase scoring changes
rust/crates/cloudsearch-index/src/lib.rs
score_phrase_query verifies proximity, sums per-term BM25 scores (postings TF with token-count fallback), then applies the existing gap penalty by dividing when gaps exceed threshold.
Boolean query BM25 propagation
rust/crates/cloudsearch-index/src/lib.rs
score_bool_query threads Bm25Context through clause scoring and averages clause BM25 scores while preserving must/filter/must_not semantics.
Test expectation updates
rust/crates/cloudsearch-index/src/lib.rs
Updated match_queries_find_tokens_in_text_fields tests to assert stable hit ordering and non-negative BM25-consistent scores instead of exact fixed-score expectations.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 In tokened fields I hop and see,

I count the rare and common free,
k1 and b hum soft and true,
Scores now sum where gaps once grew,
A rabbit nods — ranks bloom anew.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely describes the main change: implementing BM25 relevance scoring in the search index.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/bm25-scoring

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 805-806: The BM25 normalization currently computes avg_field_len
using a hardcoded "content" which misnormalizes scores for other fields; in
score_match_query replace the literal "content" with the actual field from the
query (query.field) when calling compute_avg_field_length (and keep the existing
.max(1.0) fallback), so avg_field_len is computed per the field being scored and
normalization is correct; locate compute_avg_field_length and score_match_query
references to update the argument to use query.field (with a safe fallback if
query.field can be empty).
- Around line 794-804: The IDF precompute loop uses extract_query_terms(query,
"") which strips field names and thus fails to extract terms for fielded
Match/Phrase queries, leaving idf_map empty; change the extraction to preserve
the query's field context so term lookup against self.positions_readers works:
call extract_query_terms with the actual field (or a variant that returns
fielded terms) for the incoming Query used by BM25, then compute total_df by
iterating readers and summing pl.docs.len(), compute idf via bm25_idf and insert
into idf_map as before (ensure the functions extract_query_terms, idf_map,
bm25_idf, and positions_readers are used consistently so Match/Phrase terms are
included).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5171ac10-b9ef-484c-9ce3-0775dcfc7673

📥 Commits

Reviewing files that changed from the base of the PR and between 189579c and 0672d7e.

📒 Files selected for processing (1)
  • rust/crates/cloudsearch-index/src/lib.rs

Comment thread rust/crates/cloudsearch-index/src/lib.rs Outdated
Comment thread rust/crates/cloudsearch-index/src/lib.rs Outdated
poyrazK added 4 commits May 7, 2026 20:09
- Add extract_target_field() helper to get field name from query
- Use actual query field (not hardcoded "content") for avg_field_len
- Deduplicate query_terms via BTreeSet before IDF loop
- Add Bm25Context struct holding idf_map, avg_field_len, k1, b
  to eliminate clippy::too_many_arguments warnings on scoring functions
- Add score_term_query() for non-fuzzy exact term queries using BM25;
  falls back to binary scoring for non-string fields (bool, numeric)
- score_phrase_query() now sums per-term BM25 scores instead of
  returning 1.0 for valid phrases; gap penalty still applies
- Update score_query, score_match_query, score_bool_query to use
  Bm25Context instead of 4 separate parameters
- Collapse two nested if-let chains per clippy::collapsible_if
Terms never seen in the corpus should contribute 0 to the BM25
score, not 1.0. When a term is missing from the precomputed idf_map
it means df=0, and the correct IDF contribution is 0.
Bm25Context now stores avg_field_lens: BTreeMap<String, f32> instead of
a single avg_field_len. Each field in searchable_documents gets its own
average field length computed and stored at search time.

All scoring functions (score_match_query, score_term_query,
score_phrase_query) now pass the query field to bm25_term_score,
which looks up the field-specific avg_field_len via get_avg_field_len.

extract_target_field is still called but result is unused for avg_field_len
computation (it's now computed per-field). Marked with underscore prefix
to suppress warning.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)

794-808: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

IDF precompute is still disabled for Match/Phrase queries.

The call extract_query_terms(query, "") at line 798 passes an empty string as target_field. Looking at extract_query_terms (lines 1831-1836), Match and Phrase queries have guards checking mq.field == target_field and pq.field == target_field. Since actual query fields (e.g., "message", "content") never equal "", these branches never match, so no terms are extracted.

This means idf_map remains empty for Match/Phrase query terms, causing IDF to default to 0.0 (line 1958), which makes BM25 scores effectively 0 for all matching documents.

The existing get_query_terms function (lines 2323-2349) extracts terms from Match/Phrase queries unconditionally and should be used instead.

Proposed fix
-        let _target_field = extract_target_field(query);
-        let mut idf_map: std::collections::BTreeMap<String, f32> =
-            std::collections::BTreeMap::new();
-        let query_terms: std::collections::BTreeSet<String> =
-            extract_query_terms(query, "").into_iter().collect();
+        let mut idf_map: std::collections::BTreeMap<String, f32> =
+            std::collections::BTreeMap::new();
+        let query_terms: std::collections::BTreeSet<String> =
+            get_query_terms(query).into_iter().collect();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 794 - 808, The IDF
precompute is using extract_query_terms(query, "") which passes an empty
target_field so Match/Phrase branches never match; replace that call with
get_query_terms(query) (or pass the actual target_field from
extract_target_field(query) into extract_query_terms) so query_terms correctly
contains Match/Phrase terms before computing idf_map with bm25_idf; update the
query_terms initialization used by the idf_map loop (and keep idf_map, bm25_idf,
and positions_readers usage unchanged).
🧹 Nitpick comments (3)
rust/crates/cloudsearch-index/src/lib.rs (3)

1817-1826: 💤 Low value

extract_target_field is unused and can be removed.

This function is called at line 794 but the result is assigned to _target_field which is never used. Per the PR description, this is intentional to suppress warnings, but the function and call should be removed entirely to avoid dead code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1817 - 1826, The
function extract_target_field is dead code and its sole invocation (where the
result is assigned to _target_field) is intentionally unused; remove the
extract_target_field function definition from lib.rs and delete the
corresponding call that assigns to _target_field so there is no unused function
or unused value; ensure no other references to extract_target_field remain and
run cargo build to confirm no missing imports or references.

809-825: 🏗️ Heavy lift

Per-field average length computed for all fields on every search.

This code iterates over all documents to collect all unique field names, then for each field calls compute_avg_field_length which iterates over all documents again. For N documents and F unique fields, this is O(N × F) work on every search call.

Consider computing avg_field_lens only for fields actually used in the query (extractable via extract_target_field or similar), or caching this computation and updating incrementally on index modifications.

Sketch of optimization
// Only compute for the field(s) used by the query
let query_field = extract_target_field(query);
let mut avg_field_lens: std::collections::BTreeMap<String, f32> =
    std::collections::BTreeMap::new();
avg_field_lens.insert(
    query_field.clone(),
    compute_avg_field_length(&self.searchable_documents, &query_field).max(1.0),
);
// For Bool queries, may need to extract multiple fields
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 809 - 825, The current
search builds avg_field_lens for every field by scanning
self.searchable_documents for all fields and then calling
compute_avg_field_length per field, which is O(N×F); change it to only compute
averages for fields actually referenced by the query (use extract_target_field
or similar to collect target fields from the query/Bool clauses) and build
avg_field_lens just for those fields, or implement an incremental cache updated
on index changes keyed by field name; update the code paths that reference
avg_field_lens (the variable avg_field_lens and calls to
compute_avg_field_length) to use the reduced set (or cached values) and ensure a
.max(1.0) clamp is still applied.

5731-5753: ⚡ Quick win

Test assertions don't validate BM25 scoring behavior.

The tests only assert score >= 0.0 which passes even when BM25 scores are 0 (due to the IDF precompute bug). Consider adding assertions that verify non-zero scores for matching documents, or at least that scores increase with term frequency:

// Example: verify score is meaningfully positive for a match
assert!(hello.hits.hits[0].score.unwrap_or(0.0) > 0.0,
    "BM25 score should be positive for matching documents");

This would help catch regressions in the scoring logic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 5731 - 5753, The test
currently only asserts scores are >= 0.0 for the MatchQuery results (variables
hello and both), which won’t catch a zero-IDF bug; update the assertions around
handle.search / SearchRequest / MatchQuery to assert that matching documents
have meaningfully positive BM25 scores (e.g., score.unwrap_or(0.0) > 0.0) and
where appropriate assert ordering of scores (e.g., doc-1 score > doc-2 score for
the shorter field length case) so the test fails if BM25 returns zeros or
non-monotonic scores.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1828-1854: The extract_query_terms function currently skips Match
and Phrase extraction due to the guards using the target_field parameter; remove
the field-equality guards (the "if mq.field == target_field" and "if pq.field ==
target_field" checks) or eliminate the target_field parameter entirely so Match
and Phrase branches always call tokenize; alternatively replace callers to use
get_query_terms (which already extracts across fields) instead of
extract_query_terms. Update the extract_query_terms signature/usages (and any
code invoking it) so Match, Phrase, Term (exact) and Bool branches reliably
return terms regardless of target_field, keeping the Term branch behavior for
fuzziness and the Bool branch recursion.

---

Duplicate comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 794-808: The IDF precompute is using extract_query_terms(query,
"") which passes an empty target_field so Match/Phrase branches never match;
replace that call with get_query_terms(query) (or pass the actual target_field
from extract_target_field(query) into extract_query_terms) so query_terms
correctly contains Match/Phrase terms before computing idf_map with bm25_idf;
update the query_terms initialization used by the idf_map loop (and keep
idf_map, bm25_idf, and positions_readers usage unchanged).

---

Nitpick comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1817-1826: The function extract_target_field is dead code and its
sole invocation (where the result is assigned to _target_field) is intentionally
unused; remove the extract_target_field function definition from lib.rs and
delete the corresponding call that assigns to _target_field so there is no
unused function or unused value; ensure no other references to
extract_target_field remain and run cargo build to confirm no missing imports or
references.
- Around line 809-825: The current search builds avg_field_lens for every field
by scanning self.searchable_documents for all fields and then calling
compute_avg_field_length per field, which is O(N×F); change it to only compute
averages for fields actually referenced by the query (use extract_target_field
or similar to collect target fields from the query/Bool clauses) and build
avg_field_lens just for those fields, or implement an incremental cache updated
on index changes keyed by field name; update the code paths that reference
avg_field_lens (the variable avg_field_lens and calls to
compute_avg_field_length) to use the reduced set (or cached values) and ensure a
.max(1.0) clamp is still applied.
- Around line 5731-5753: The test currently only asserts scores are >= 0.0 for
the MatchQuery results (variables hello and both), which won’t catch a zero-IDF
bug; update the assertions around handle.search / SearchRequest / MatchQuery to
assert that matching documents have meaningfully positive BM25 scores (e.g.,
score.unwrap_or(0.0) > 0.0) and where appropriate assert ordering of scores
(e.g., doc-1 score > doc-2 score for the shorter field length case) so the test
fails if BM25 returns zeros or non-monotonic scores.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5573ccb2-96f1-4f32-b17a-c2fc53781eaa

📥 Commits

Reviewing files that changed from the base of the PR and between 0bf2005 and 14013f7.

📒 Files selected for processing (1)
  • rust/crates/cloudsearch-index/src/lib.rs

Comment thread rust/crates/cloudsearch-index/src/lib.rs Outdated
poyrazK added 2 commits May 12, 2026 18:57
- Add BM25 scoring for TermsQuery (OR over multiple values), using max
  IDF among matching terms
- Add SearchQuery::Terms to extract_query_terms for proper IDF computation
- Fix bm25_idf(0, n_docs) to return 1.0 instead of 0.0 (avoids zero
  scores when postings are empty after refresh but before flush)
- Change IDF fallback from 0.0 to 1.0 across all scoring functions
- Handle numeric and boolean values in TermsQuery via value equality
- Change phrase gap factor from multiplier to divisor to avoid amplifying
  large BM25 term scores
- Remove unused _target_field variable and extract_target_field function
  (was never used since extract_query_terms is called with "")
- Remove per-token averaging from score_match_query: BM25 already normalizes
  via TF saturation, so dividing by matched token count distorted scores
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)

2098-2196: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

score_phrase_query recomputes BM25 per matching position; hoist it out of the position loop.

The inner BM25 sum at lines 2162–2181 is rebuilt for every first_pos of the first term in this document, but tf, doc_len, idf, and phrase.field are all invariant within the document — only max_gap varies. For a document where the first term appears N times, this is N × (terms × readers) redundant work, plus an extra tokenize(field_str) per fallback path. Worse, the inner for reader in positions_readers on line 2165 shadows the outer reader and re-iterates every segment for every term, every position.

Compute the BM25 term-score sum once per document (outside the position loops), iterate occurrences only to find the smallest max_gap, then apply the gap divisor once.

♻️ Sketch of the refactor
// Compute the doc-level BM25 sum once.
let mut term_score_sum = 0.0f32;
for term in &query_tokens {
    let mut tf = 0u32;
    for r in positions_readers {
        if let Some(pl) = r.get(term)
            && let Ok(idx) = pl.docs.binary_search_by(|p| p.doc_id.cmp(&doc_id))
        {
            tf += pl.docs[idx].term_freq;
        }
    }
    if tf == 0 {
        tf = field_tokens.iter().filter(|t| *t == term).count() as u32;
    }
    if tf == 0 {
        continue;
    }
    let idf = bm25_ctx.idf_map.get(term).copied().unwrap_or(1.0);
    term_score_sum += bm25_ctx.bm25_term_score(tf, doc_len, idf, &phrase.field);
}

// Then walk positions only to find the best (smallest) max_gap across phrase matches,
// and apply the divisor once at the end.

Side-note: line 2107 linearly scans posting_list.docs for the matching doc_id even though every other BM25 path here uses binary_search_by(|p| p.doc_id.cmp(&doc_id)). Worth aligning for consistency and O(log n) lookup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 2098 - 2196, In
score_phrase_query, the BM25 term_score_sum is recomputed inside the
per-position loop (and the inner `for reader in positions_readers` shadows the
outer `reader`), causing N× redundant work; move the BM25 accumulation out so
you compute term_score_sum once per document (iterate over `query_tokens` and
`positions_readers` to compute tf → idf → bm25_term_score, with the same
fallback using `field_tokens`), then only iterate positions to find the
best/minimum `max_gap` for that document and apply the gap divisor once to the
precomputed term_score_sum before updating best_score; also replace the linear
scan for the document in the posting_list with the existing binary_search_by
pattern used elsewhere to avoid O(n) scans.
♻️ Duplicate comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)

794-807: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

IDF precompute still broken for Match/Phrase queries (regression of prior fix).

Line 797 still passes "" as target_field, and the match arms in extract_query_terms (lines 1821, 1824) keep the if mq.field == target_field / if pq.field == target_field guards. Since no real field is ever named "", Match and Phrase terms are never extracted, idf_map stays empty for these queries, and score_match_query/score_phrase_query fall back to the constant idf = 1.0 at lines 1954, 2179. The end-to-end BM25 IDF the PR advertises is effectively disabled for the most common query types.

This was previously flagged and marked "✅ Addressed", but the current diff has not actually changed the call site or the guards.

🔧 Suggested fix

Either drop the target_field parameter+guards so Match/Phrase always tokenize, or reuse the unguarded get_query_terms for the IDF precompute:

-        let query_terms: std::collections::BTreeSet<String> =
-            extract_query_terms(query, "").into_iter().collect();
+        let query_terms: std::collections::BTreeSet<String> =
+            get_query_terms(query).into_iter().collect();

And, if extract_query_terms has no other callers after this change, remove it (and its target_field parameter) to avoid future foot-guns.

Also applies to: 1818-1850

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 794 - 807, The IDF
precompute is calling extract_query_terms(query, "") so Match/Phrase terms are
never collected due to the target_field guards; change the IDF precompute to use
the unguarded get_query_terms(query) (or call extract_query_terms without a
field guard by removing the target_field check), so idf_map is populated for
Match/Phrase queries used by score_match_query and score_phrase_query; if you
switch to get_query_terms and extract_query_terms is no longer used, remove the
redundant extract_query_terms signature/guards to avoid future regressions.
🧹 Nitpick comments (1)
rust/crates/cloudsearch-index/src/lib.rs (1)

1805-1816: 💤 Low value

Docstring overstates when bm25_idf returns 0.

The doc comment says it returns 0 "when df > N (shouldn't happen in practice)", but the actual .max(0.0) clamp also fires for any term appearing in more than half the corpus (whenever (N - df + 0.5) / (df + 0.5) < 1, i.e. df > (N - 0.5) / 2). That's a common, expected case — stopword-like terms in small corpora — and is exactly what the new test at lines 5778–5780 already acknowledges. Worth updating the comment so future readers don't chase a phantom invariant.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/cloudsearch-index/src/lib.rs` around lines 1805 - 1816, The doc
comment for bm25_idf is inaccurate about when the function returns 0; update the
docs to reflect that the `.max(0.0)` clamp will produce 0 whenever the computed
IDF is negative (e.g., for terms appearing in roughly more than half the corpus,
not only the impossible df > N case), while preserving the current behavior that
df == 0 returns a default 1.0; edit the comment above fn bm25_idf to state these
exact conditions (df == 0 -> returns 1.0; otherwise compute log((N - df +
0.5)/(df + 0.5)) and clamp negatives to 0) so future readers aren’t misled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 789-807: The IDF calculation mixes corpora: n_docs is taken from
searchable_documents while total_df is summed from positions_readers (populated
only on flush), causing inconsistent IDF after refresh; fix by making N and DF
come from the same source — either (A) compute total_df directly from
searchable_documents by checking/tokenizing each document for presence of each
term (use the same tokenization as extract_query_terms and
per_doc_inverted_index) and keep n_docs = searchable_documents.len().max(1), or
(B) compute both N and DF from positions_readers (set n_docs =
sum(reader.docs.len() for reader in self.positions_readers).max(1)) so
bm25_idf(total_df, n_docs) uses consistent corpora; update the block around
extract_query_terms, idf_map, and bm25_idf accordingly and ensure behavior is
consistent with flush/refresh semantics referenced in flush and refresh.

---

Outside diff comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 2098-2196: In score_phrase_query, the BM25 term_score_sum is
recomputed inside the per-position loop (and the inner `for reader in
positions_readers` shadows the outer `reader`), causing N× redundant work; move
the BM25 accumulation out so you compute term_score_sum once per document
(iterate over `query_tokens` and `positions_readers` to compute tf → idf →
bm25_term_score, with the same fallback using `field_tokens`), then only iterate
positions to find the best/minimum `max_gap` for that document and apply the gap
divisor once to the precomputed term_score_sum before updating best_score; also
replace the linear scan for the document in the posting_list with the existing
binary_search_by pattern used elsewhere to avoid O(n) scans.

---

Duplicate comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 794-807: The IDF precompute is calling extract_query_terms(query,
"") so Match/Phrase terms are never collected due to the target_field guards;
change the IDF precompute to use the unguarded get_query_terms(query) (or call
extract_query_terms without a field guard by removing the target_field check),
so idf_map is populated for Match/Phrase queries used by score_match_query and
score_phrase_query; if you switch to get_query_terms and extract_query_terms is
no longer used, remove the redundant extract_query_terms signature/guards to
avoid future regressions.

---

Nitpick comments:
In `@rust/crates/cloudsearch-index/src/lib.rs`:
- Around line 1805-1816: The doc comment for bm25_idf is inaccurate about when
the function returns 0; update the docs to reflect that the `.max(0.0)` clamp
will produce 0 whenever the computed IDF is negative (e.g., for terms appearing
in roughly more than half the corpus, not only the impossible df > N case),
while preserving the current behavior that df == 0 returns a default 1.0; edit
the comment above fn bm25_idf to state these exact conditions (df == 0 ->
returns 1.0; otherwise compute log((N - df + 0.5)/(df + 0.5)) and clamp
negatives to 0) so future readers aren’t misled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af4932e1-65bf-43c3-b1a5-baddc87e2d7f

📥 Commits

Reviewing files that changed from the base of the PR and between 14013f7 and dca94fc.

📒 Files selected for processing (1)
  • rust/crates/cloudsearch-index/src/lib.rs

Comment thread rust/crates/cloudsearch-index/src/lib.rs
poyrazK added 3 commits May 14, 2026 16:53
- Add BM25 scoring for Prefix and Wildcard queries using per-token scoring
- Use binary match (matches_prefix_query/matches_wildcard_query) as initial
  filter before BM25 scoring to preserve matching behavior
- Prefix scoring: iterate field tokens, score those starting with the prefix
- Wildcard scoring: iterate field tokens, score those matching the wildcard pattern
- For prefix queries, pre-compute IDF for the full stored value in the idf_map
  since the prefix check uses the full field value first
When a field token matches the prefix but isn't in idf_map directly,
fall back to the full stored value's IDF (pre-computed in search()).
This ensures prefix scores reflect actual term rarity rather than
defaulting all unmatched tokens to IDF=1.0.
Apply cargo fmt and fix clippy issues:
- Use `or_insert` with entry API for contains_key + insert pattern
- Use `str::to_lowercase` instead of closure
- Add `#[allow(clippy::cast_precision_loss)]` to numeric cast-heavy functions
- Use `u32::try_from(...).unwrap_or(0)` for safe truncation
- Collapse nested if with `&& let` pattern
- Add `#[allow(clippy::too_many_lines)]` on search function
- Use `k.clone()` instead of `k.to_string()`
- Fix trailing comma formatting on match arms
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant