feat: opt-in hybrid retrieval (BM25 + dense + RRF)#9
Merged
Conversation
Add the opt-in vector layer for hybrid retrieval: a sqlite-vec backed chunk_vec store with a dim stamp in schema_meta (mixing vector spaces fails loudly), deterministic KNN tie-breaks for replayable retrieval, and an Embedder protocol with a lazy sentence-transformers backend so the base install never pays for torch. sqlite-vec joins the dev extra (tiny wheel) so CI exercises the vec tests; tests use a deterministic FakeEmbedder via the set_embedder_factory hook instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
retrieve() routes on workspace.embedding_model: NULL keeps the exact BM25 path (golden outputs stay byte-identical), set runs both legs and fuses with rank-only Reciprocal Rank Fusion — BM25 scores and vector distances are not comparable, so fusion uses ranks and keeps the BM25 instance on overlap to preserve the real score in traces. Missing deps or chunk_vec degrade to BM25 with a warning instead of failing: a read path must not hard-fail on an optional acceleration. Ties sort by chunk_id so replays stay deterministic; residual query-embedding nondeterminism is documented in the module docstring. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When workspace.embedding_model is set, ingest embeds chunk texts before BEGIN IMMEDIATE (model inference must not hold the write lock) and inserts chunk_vec rows in the same transaction as the chunk rows, so corpus and vectors can never diverge. Missing embedding deps fail loudly as IngestError with an install hint: a workspace that promised hybrid retrieval must not silently accumulate unembedded chunks. backfill_embeddings() upgrades pre-existing corpora idempotently, stamping each vector with its chunk's original corpus_version so frozen replay filters stay truthful. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
'orc workspace create --embeddings [--embedding-model M]' pins the model into the workspace row (warn-but-create when deps are missing, so intent is recorded before the extra is installed). 'orc workspace embed NAME [--model M]' backfills chunk_vec for existing corpora, setting the model when NULL and refusing a conflicting --model: vectors from different models cannot be mixed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
search_evidence, research_topic, and verify_claim (evidence mode) now go through retrieve(), recording the method actually used (bm25 or hybrid_rrf) and the union candidate count in the trace. Judgment, binary, decomposed, and arithmetic modes are untouched. Frozen replay warns when the retrieval method differs from the original trace — e.g. hybrid_rrf -> bm25 because embedding deps are absent at replay time — since the chunk pool may then differ despite the pinned corpus_version. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The search table said "BM25 results" even when the trace records hybrid_rrf, and sentence-transformers 5.x renamed get_sentence_embedding_dimension (FutureWarning on first model load). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Both validation studies (adversarial web research + Delphi panel) flagged BM25-only retrieval as below the current baseline for verification recall — paraphrase misses are exactly where hallucinations hide. This adds opt-in hybrid retrieval with a local embedder (no API keys).
Changes
Embedderprotocol + lazy sentence-transformers all-MiniLM-L6-v2 default (384-dim), pluggable for Voyage/OpenAI later;set_embedder_factorytest hook.embeddings_store.py: sqlite-vecchunk_vec(dim-stamped), corpus_version as vec0 metadata column (KNN-filterable), deterministic tie-breaks, idempotent backfill.hybrid.py:vector_search+ rank-only RRF (k=60, overlap keeps real bm25_score, ULID tie-break) + singleretrieve()entry point; graceful BM25 fallback when deps/table missing.IngestError) when the model is set but deps missing.orc workspace create --embeddings [--embedding-model],orc workspace embed NAMEbackfill.method: hybrid_rrf|bm25; frozen replay warns on retrieval-method drift.Opt-in only: workspaces without the flag take the identical BM25 path — golden tests and benchmarks unaffected.
Testing
41 new tests (store/RRF math/fallbacks/ingest/replay/CLI) with a deterministic FakeEmbedder, all RED-first; full suite 337 passed, ruff clean. sqlite-vec added to dev extra so CI runs vec tests; sentence-transformers stays out of CI (fake embedder).
🤖 Generated with Claude Code