feat: opt-in hybrid retrieval (BM25 + dense + RRF) by Thormatt · Pull Request #9 · Thormatt/orc

Thormatt · 2026-06-12T16:38:36Z

Context

Both validation studies (adversarial web research + Delphi panel) flagged BM25-only retrieval as below the current baseline for verification recall — paraphrase misses are exactly where hallucinations hide. This adds opt-in hybrid retrieval with a local embedder (no API keys).

Changes

Embedder protocol + lazy sentence-transformers all-MiniLM-L6-v2 default (384-dim), pluggable for Voyage/OpenAI later; set_embedder_factory test hook.
embeddings_store.py: sqlite-vec chunk_vec (dim-stamped), corpus_version as vec0 metadata column (KNN-filterable), deterministic tie-breaks, idempotent backfill.
hybrid.py: vector_search + rank-only RRF (k=60, overlap keeps real bm25_score, ULID tie-break) + single retrieve() entry point; graceful BM25 fallback when deps/table missing.
Ingest embeds chunks atomically with the chunk transaction; fails loud (IngestError) when the model is set but deps missing.
CLI: orc workspace create --embeddings [--embedding-model], orc workspace embed NAME backfill.
All three retrieval call sites wired; trace records method: hybrid_rrf|bm25; frozen replay warns on retrieval-method drift.
Replay determinism: corpus_version pins both legs; stored vectors immutable; residual caveats documented.

Opt-in only: workspaces without the flag take the identical BM25 path — golden tests and benchmarks unaffected.

Testing

41 new tests (store/RRF math/fallbacks/ingest/replay/CLI) with a deterministic FakeEmbedder, all RED-first; full suite 337 passed, ruff clean. sqlite-vec added to dev extra so CI runs vec tests; sentence-transformers stays out of CI (fake embedder).

🤖 Generated with Claude Code

Add the opt-in vector layer for hybrid retrieval: a sqlite-vec backed chunk_vec store with a dim stamp in schema_meta (mixing vector spaces fails loudly), deterministic KNN tie-breaks for replayable retrieval, and an Embedder protocol with a lazy sentence-transformers backend so the base install never pays for torch. sqlite-vec joins the dev extra (tiny wheel) so CI exercises the vec tests; tests use a deterministic FakeEmbedder via the set_embedder_factory hook instead. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

retrieve() routes on workspace.embedding_model: NULL keeps the exact BM25 path (golden outputs stay byte-identical), set runs both legs and fuses with rank-only Reciprocal Rank Fusion — BM25 scores and vector distances are not comparable, so fusion uses ranks and keeps the BM25 instance on overlap to preserve the real score in traces. Missing deps or chunk_vec degrade to BM25 with a warning instead of failing: a read path must not hard-fail on an optional acceleration. Ties sort by chunk_id so replays stay deterministic; residual query-embedding nondeterminism is documented in the module docstring. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

When workspace.embedding_model is set, ingest embeds chunk texts before BEGIN IMMEDIATE (model inference must not hold the write lock) and inserts chunk_vec rows in the same transaction as the chunk rows, so corpus and vectors can never diverge. Missing embedding deps fail loudly as IngestError with an install hint: a workspace that promised hybrid retrieval must not silently accumulate unembedded chunks. backfill_embeddings() upgrades pre-existing corpora idempotently, stamping each vector with its chunk's original corpus_version so frozen replay filters stay truthful. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

'orc workspace create --embeddings [--embedding-model M]' pins the model into the workspace row (warn-but-create when deps are missing, so intent is recorded before the extra is installed). 'orc workspace embed NAME [--model M]' backfills chunk_vec for existing corpora, setting the model when NULL and refusing a conflicting --model: vectors from different models cannot be mixed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

search_evidence, research_topic, and verify_claim (evidence mode) now go through retrieve(), recording the method actually used (bm25 or hybrid_rrf) and the union candidate count in the trace. Judgment, binary, decomposed, and arithmetic modes are untouched. Frozen replay warns when the retrieval method differs from the original trace — e.g. hybrid_rrf -> bm25 because embedding deps are absent at replay time — since the chunk pool may then differ despite the pinned corpus_version. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The search table said "BM25 results" even when the trace records hybrid_rrf, and sentence-transformers 5.x renamed get_sentence_embedding_dimension (FutureWarning on first model load). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Thormatt and others added 6 commits June 12, 2026 12:12

Thormatt merged commit 5fb171e into main Jun 12, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: opt-in hybrid retrieval (BM25 + dense + RRF)#9

feat: opt-in hybrid retrieval (BM25 + dense + RRF)#9
Thormatt merged 6 commits into
mainfrom
feat/hybrid-retrieval

Thormatt commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thormatt commented Jun 12, 2026

Context

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant