Skip to content

feat(search): dense semantic search with v5 embedding model#85

Merged
kdroidFilter merged 1 commit into
masterfrom
feat/v5-dense-search
Jun 26, 2026
Merged

feat(search): dense semantic search with v5 embedding model#85
kdroidFilter merged 1 commit into
masterfrom
feat/v5-dense-search

Conversation

@kdroidFilter

Copy link
Copy Markdown
Owner

Summary

Adds hybrid lexical + dense semantic search powered by the v5 Hebrew/Aramaic embedding model.

  • SeforimEmbedder — ONNX (v5 int8) query embedder; HebrewV5Normalizer (strip nikud/teamim + final-letter folding) to match training.
  • HybridSearchEngine — fuses BM25 + dense (RRF); VectorSearcher over a fused Lucene index (KnnFloatVectorField, cosine).
  • Generator: BuildVectorIndex + fused text/vector indexing.
  • Packaging: bundle the v5 model next to the DB (PackageArtifacts); DownloadEmbedModel fetches the v5-int8 release.
  • CI: free disk space on the runner before the build.

Model

v5b (full-corpus, final-folded). Fair link eval on identical 2000 pairs: R@1 0.461 / R@10 0.870 / MRR 0.606 vs v4 0.418. int8 quantization ≈ 0 quality loss (cosine 0.999).

Notes

  • Query and passage vectors share the same v5 weights + normalization (verified: embedder self-cosine ~1, related ≫ unrelated).
  • v4 model/normalizer fully removed.

@kdroidFilter kdroidFilter force-pushed the feat/v5-dense-search branch 8 times, most recently from 2cd5766 to b9e938b Compare June 26, 2026 05:58
- Add SeforimEmbedder (ONNX v5 int8) + HebrewV5Normalizer (final-letter folding)
- HybridSearchEngine (BM25 + dense, RRF) + VectorSearcher over a fused Lucene index
- BuildVectorIndex / fused KnnFloatVectorField indexing in the generator
- Bundle + fetch the v5 model (PackageArtifacts, DownloadEmbedModel -> v5-int8)
- CI: free disk space on the runner before the build
@kdroidFilter kdroidFilter force-pushed the feat/v5-dense-search branch from b9e938b to 8f8da4e Compare June 26, 2026 06:07
@kdroidFilter kdroidFilter merged commit a72cc8d into master Jun 26, 2026
1 check passed
@kdroidFilter kdroidFilter deleted the feat/v5-dense-search branch June 26, 2026 06:11
@Y-PLONI

Y-PLONI commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

מתוכנן לשחרור ציבורי?

@kdroidFilter

Copy link
Copy Markdown
Owner Author

אני לא יודע, אוליי, אבל מה זה יעזור לכם ?
TANTIVY לא תומך עד כמה שאני יודע בחיפוש ווקטורי

@Y-PLONI

Y-PLONI commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

זו כבר בעייה שלנו 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants