Skip to content

feat(loader): Gemini embedder + Kreuzberg PDF extraction in the build CLI#199

Merged
jrosskopf merged 1 commit into
mainfrom
loader/gemini-embedder
Jun 24, 2026
Merged

feat(loader): Gemini embedder + Kreuzberg PDF extraction in the build CLI#199
jrosskopf merged 1 commit into
mainfrom
loader/gemini-embedder

Conversation

@jrosskopf

Copy link
Copy Markdown
Contributor

Summary

Wire the existing GeminiEmbedder and KreuzbergExtractor into the escurel-loader build CLI so the offline loader can produce Gemini (gemini-embedding-001, dim 768) embeddings + extract PDFs — making it usable to feed a Gemini-backed live tenant (e.g. herkules).

  • --embedder gemini (feature gemini) → GeminiEmbedder (key from --api-key/ESCUREL_GEMINI_API_KEY/GEMINI_API_KEY, --embed-model/--embed-dim). Manifest records model_id=gemini-embedding-001, so transfer --expect-model gemini-embedding-001 matches.
  • Extractor selects KreuzbergExtractor under the existing kreuzberg feature (was hardcoded PlainTextExtractor → PDFs extracted to nothing).
  • "Offline" still means no escurel-server / no per-tenant quota; it calls the Gemini HTTP API in one batched pass.

Test plan

  • cargo build -p escurel-loader --features "gemini kreuzberg" ✓; default + gemini clippy -D warnings clean ✓; existing loader tests (hash) pass ✓.
  • Live: escurel-loader build --embedder gemini over 2 Landtag PDFs → 2 docs, 105 chunks, all dense_vec non-zero, manifest gemini-embedding-001/768/schema 5.

🤖 Generated with Claude Code

… CLI

The offline loader could only embed with hash/zero and extract plain text, so a
Gemini-backed live tenant (which needs gemini-embedding-001 vectors) couldn't be
fed from it. Wire the existing GeminiEmbedder + KreuzbergExtractor into the CLI:

- `--embedder gemini` (feature `gemini`): builds GeminiEmbedder reading the key
  from --api-key / ESCUREL_GEMINI_API_KEY / GEMINI_API_KEY, with --embed-model
  (default gemini-embedding-001) + --embed-dim (default 768). The manifest then
  records model_id=gemini-embedding-001, so `transfer --expect-model
  gemini-embedding-001` matches a Gemini tenant. "Offline" = no escurel-server /
  no per-tenant quota; it still calls the Gemini HTTP API in one batched pass.
- Extractor now selects KreuzbergExtractor under the existing `kreuzberg` feature
  (PDF/DOCX/PPTX/XLSX), falling back to PlainText otherwise — mirroring the
  server (escurel-server/src/mcp.rs). Previously the CLI hardcoded PlainText, so
  PDFs extracted to nothing.

Verified live: `escurel-loader build --embedder gemini` over 2 Landtag PDFs →
2 docs, 105 chunks, all dense_vec non-zero, manifest gemini-embedding-001/768/v5.
Default + gemini clippy clean; existing loader tests (hash) still pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jrosskopf jrosskopf merged commit 68c0e41 into main Jun 24, 2026
1 check passed
@jrosskopf jrosskopf deleted the loader/gemini-embedder branch June 24, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant