Skip to content

feat: opt-in hybrid retrieval (BM25 + dense + RRF)#9

Merged
Thormatt merged 6 commits into
mainfrom
feat/hybrid-retrieval
Jun 12, 2026
Merged

feat: opt-in hybrid retrieval (BM25 + dense + RRF)#9
Thormatt merged 6 commits into
mainfrom
feat/hybrid-retrieval

Conversation

@Thormatt

Copy link
Copy Markdown
Owner

Context

Both validation studies (adversarial web research + Delphi panel) flagged BM25-only retrieval as below the current baseline for verification recall — paraphrase misses are exactly where hallucinations hide. This adds opt-in hybrid retrieval with a local embedder (no API keys).

Changes

  • Embedder protocol + lazy sentence-transformers all-MiniLM-L6-v2 default (384-dim), pluggable for Voyage/OpenAI later; set_embedder_factory test hook.
  • embeddings_store.py: sqlite-vec chunk_vec (dim-stamped), corpus_version as vec0 metadata column (KNN-filterable), deterministic tie-breaks, idempotent backfill.
  • hybrid.py: vector_search + rank-only RRF (k=60, overlap keeps real bm25_score, ULID tie-break) + single retrieve() entry point; graceful BM25 fallback when deps/table missing.
  • Ingest embeds chunks atomically with the chunk transaction; fails loud (IngestError) when the model is set but deps missing.
  • CLI: orc workspace create --embeddings [--embedding-model], orc workspace embed NAME backfill.
  • All three retrieval call sites wired; trace records method: hybrid_rrf|bm25; frozen replay warns on retrieval-method drift.
  • Replay determinism: corpus_version pins both legs; stored vectors immutable; residual caveats documented.

Opt-in only: workspaces without the flag take the identical BM25 path — golden tests and benchmarks unaffected.

Testing

41 new tests (store/RRF math/fallbacks/ingest/replay/CLI) with a deterministic FakeEmbedder, all RED-first; full suite 337 passed, ruff clean. sqlite-vec added to dev extra so CI runs vec tests; sentence-transformers stays out of CI (fake embedder).

🤖 Generated with Claude Code

Thormatt and others added 6 commits June 12, 2026 12:12
Add the opt-in vector layer for hybrid retrieval: a sqlite-vec backed
chunk_vec store with a dim stamp in schema_meta (mixing vector spaces
fails loudly), deterministic KNN tie-breaks for replayable retrieval,
and an Embedder protocol with a lazy sentence-transformers backend so
the base install never pays for torch. sqlite-vec joins the dev extra
(tiny wheel) so CI exercises the vec tests; tests use a deterministic
FakeEmbedder via the set_embedder_factory hook instead.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
retrieve() routes on workspace.embedding_model: NULL keeps the exact
BM25 path (golden outputs stay byte-identical), set runs both legs and
fuses with rank-only Reciprocal Rank Fusion — BM25 scores and vector
distances are not comparable, so fusion uses ranks and keeps the BM25
instance on overlap to preserve the real score in traces. Missing deps
or chunk_vec degrade to BM25 with a warning instead of failing: a read
path must not hard-fail on an optional acceleration. Ties sort by
chunk_id so replays stay deterministic; residual query-embedding
nondeterminism is documented in the module docstring.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
When workspace.embedding_model is set, ingest embeds chunk texts
before BEGIN IMMEDIATE (model inference must not hold the write lock)
and inserts chunk_vec rows in the same transaction as the chunk rows,
so corpus and vectors can never diverge. Missing embedding deps fail
loudly as IngestError with an install hint: a workspace that promised
hybrid retrieval must not silently accumulate unembedded chunks.
backfill_embeddings() upgrades pre-existing corpora idempotently,
stamping each vector with its chunk's original corpus_version so
frozen replay filters stay truthful.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
'orc workspace create --embeddings [--embedding-model M]' pins the
model into the workspace row (warn-but-create when deps are missing,
so intent is recorded before the extra is installed). 'orc workspace
embed NAME [--model M]' backfills chunk_vec for existing corpora,
setting the model when NULL and refusing a conflicting --model:
vectors from different models cannot be mixed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
search_evidence, research_topic, and verify_claim (evidence mode) now
go through retrieve(), recording the method actually used (bm25 or
hybrid_rrf) and the union candidate count in the trace. Judgment,
binary, decomposed, and arithmetic modes are untouched. Frozen replay
warns when the retrieval method differs from the original trace —
e.g. hybrid_rrf -> bm25 because embedding deps are absent at replay
time — since the chunk pool may then differ despite the pinned
corpus_version.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The search table said "BM25 results" even when the trace records
hybrid_rrf, and sentence-transformers 5.x renamed
get_sentence_embedding_dimension (FutureWarning on first model load).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@Thormatt Thormatt merged commit 5fb171e into main Jun 12, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant