feat(loader): Gemini embedder + Kreuzberg PDF extraction in the build CLI#199
Merged
Conversation
… CLI The offline loader could only embed with hash/zero and extract plain text, so a Gemini-backed live tenant (which needs gemini-embedding-001 vectors) couldn't be fed from it. Wire the existing GeminiEmbedder + KreuzbergExtractor into the CLI: - `--embedder gemini` (feature `gemini`): builds GeminiEmbedder reading the key from --api-key / ESCUREL_GEMINI_API_KEY / GEMINI_API_KEY, with --embed-model (default gemini-embedding-001) + --embed-dim (default 768). The manifest then records model_id=gemini-embedding-001, so `transfer --expect-model gemini-embedding-001` matches a Gemini tenant. "Offline" = no escurel-server / no per-tenant quota; it still calls the Gemini HTTP API in one batched pass. - Extractor now selects KreuzbergExtractor under the existing `kreuzberg` feature (PDF/DOCX/PPTX/XLSX), falling back to PlainText otherwise — mirroring the server (escurel-server/src/mcp.rs). Previously the CLI hardcoded PlainText, so PDFs extracted to nothing. Verified live: `escurel-loader build --embedder gemini` over 2 Landtag PDFs → 2 docs, 105 chunks, all dense_vec non-zero, manifest gemini-embedding-001/768/v5. Default + gemini clippy clean; existing loader tests (hash) still pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wire the existing GeminiEmbedder and KreuzbergExtractor into the
escurel-loader buildCLI so the offline loader can produce Gemini (gemini-embedding-001, dim 768) embeddings + extract PDFs — making it usable to feed a Gemini-backed live tenant (e.g. herkules).--embedder gemini(featuregemini) →GeminiEmbedder(key from--api-key/ESCUREL_GEMINI_API_KEY/GEMINI_API_KEY,--embed-model/--embed-dim). Manifest recordsmodel_id=gemini-embedding-001, sotransfer --expect-model gemini-embedding-001matches.KreuzbergExtractorunder the existingkreuzbergfeature (was hardcodedPlainTextExtractor→ PDFs extracted to nothing).Test plan
cargo build -p escurel-loader --features "gemini kreuzberg"✓; default +geminiclippy -D warningsclean ✓; existing loader tests (hash) pass ✓.escurel-loader build --embedder geminiover 2 Landtag PDFs → 2 docs, 105 chunks, alldense_vecnon-zero, manifestgemini-embedding-001/768/schema 5.🤖 Generated with Claude Code