feat(loader): offline batch loader + DuckDB→DuckDB transfer (carry embeddings, no re-embed) by jrosskopf · Pull Request #198 · DataZooDE/escurel

jrosskopf · 2026-06-23T05:28:05Z

Summary

Offline batch loader + DuckDB→DuckDB transfer that loads a large corpus (e.g.
~20k PDFs) without hitting the live server's per-tenant Embeds quota or paying
to re-embed in production. The expensive step (extract → chunk → embed)
happens once, offline, in a throwaway loader instance; the result transfers
into the live tenant carrying the embeddings as data — prod never re-embeds.

Design note: docs/notes/2026-06-23-batch-loader.md.
Runbook: docs/operations.md → "Bulk-loading a large corpus".

escurel-loader build    --src <dir> --out <loader_dir> --skill <name> --embedder hash
escurel-loader transfer --from <loader_dir> --to <live_data_dir> \
                        --tenant <t> --expect-model hash --on-collision skip

What's in it (7 logical changes, one branch)

Embedder::model_id() — pins the embedding space (Zero/Hash/Candle/Gemini/Reloadable).
Split materialize_document into embed → write_document_blocks(page_id, overlay, chunks, vectors) — lets the transfer supply precomputed vectors (the no-re-embed path).
escurel-loader crate — LoaderBuilder runs the real extract→chunk→embed→materialise pipeline in-process against a standalone DuckDB; writes manifest.json (model_id/dim/schema_version). Adds Indexer::reindex_vectors().
Indexer::merge_from_attached(alias, on_collision) — the DuckDB→DuckDB transfer: drop HNSW → INSERT … BY NAME SELECT (skip/replace/error) → recreate HNSW → refresh FTS. Vectors copy verbatim.
copy_files — blobs + overlay markdown LaneStore→LaneStore, files-first (crash-safe ordering).
transfer fn + escurel-loader operator binary — validates the manifest against the live tenant's --expect-model before touching anything; fails closed on mismatch.
Docs — design note, ops runbook, discovered note.

Notable decisions / findings

The binary lives in escurel-loader, not escurel-cli: the transfer must ATTACH two DuckDB files on disk, which can't go through the gateway HTTP client surface that escurel-cli is built on. escurel-cli stays a pure presentation layer. (Deviation from the plan's "escurel-cli/src/admin.rs" — host-side disk ops are a hard mismatch for the all-HTTP CLI.)
Bug fixed (caught by the idempotent re-transfer test): a transfer into an already-migrated tenant re-ran Migrator::up → "Table pages already exists". Now gates up on fresh = !db_path.exists(), mirroring the server boot recipe.
Gemini Batch-API embedder (~half price) is deferred — --embedder already abstracts the space and the manifest gate makes adding gemini-batch:<model> safe later.

Test plan

All no-mock E2E (real DuckDB files, real FsStore, real ATTACH, real extraction):

crates/escurel-index/tests/write_document_blocks.rs — supplied vectors stored byte-equal; count-mismatch fails closed.
crates/escurel-index/tests/merge_from_attached.rs::{merge_skip_imports_new_pages_keeps_existing_and_copies_vectors_verbatim, merge_replace_overwrites_colliding_pages, merge_error_aborts_on_collision} — target uses a ZeroEmbedder so the byte-identical imported vector is the no-re-embed proof.
crates/escurel-loader/tests/build.rs — loader build materialises docs with non-NULL vectors + manifest.
crates/escurel-loader/tests/copy_files.rs — blobs byte-equal + overlays re-keyed + idempotent.
crates/escurel-loader/tests/transfer.rs::{transfer_carries_embeddings_verbatim_and_is_idempotent, transfer_aborts_on_embedder_model_mismatch} — capstone: build (HashEmbedder, non-zero vectors) → transfer → live dense_vec byte-identical to loader's, instances search-hittable, re-transfer idempotent, model-mismatch aborts before any rows move.
Plus model_id unit tests across escurel-embed.

Four-check gate green locally: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace --all-targets, cargo build --workspace --release.

🤖 Generated with Claude Code

The offline batch loader + DuckDB->DuckDB transfer must refuse to import vectors produced by a different embedding model (mixing embedding spaces silently destroys retrieval). The Embedder trait exposed only dim(); add model_id() so (model_id, dim) identifies the space, recorded in the loader's artifact manifest and validated on transfer. - trait: fn model_id(&self) -> String (owned, so ReloadableEmbedder can delegate through its ArcSwap guard; called rarely). - impls: Zero="zero", Hash="hash", Gemini=its model name, Candle=the repo/model id (new Inner.model_id field threaded through from_local/from_hf_hub), Reloadable delegates to the live inner. - test SlowEmbedder (escurel-index) updated. Red→green: model_id_tests assert stable, distinct ids (+ gemini under feature). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Extract Indexer::write_document_blocks(page_id, overlay, chunks, vectors) — the embed-free write half — and make materialize_document = embed then write_document_blocks. This lets the offline batch loader supply vectors from a separate (e.g. Gemini Batch API) pass, and underpins the no-re-embed guarantee: the write path stores the given vectors verbatim, never consulting the embedder. Validates vectors.len()==chunks.len() and each dim==768, fail-closed. Behaviour of materialize_document is unchanged (existing document tests green). Red→green (no mocks): tests/write_document_blocks.rs — with a ZeroEmbedder indexer, writing distinct non-zero sentinel vectors and reopening the DuckDB file reads them back verbatim (would be zero if it re-embedded); count-mismatch fails closed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

New crate `escurel-loader`: a LoaderBuilder that runs the existing extract→chunk→embed→materialise pipeline (DocumentIngestWorker → Indexer::materialize_document) IN-PROCESS against a throwaway loader instance (its own DuckDB + blob dir), at full speed with no HTTP/quota. Walks a source dir, infers MIME by extension, deposits each file as an inbox blob, and materialises it under the given document skill. instance_id = content sha256 (idempotent + dedup). Rebuilds the HNSW vector index + BM25 FTS once at the end. Writes manifest.json {model_id, dim, schema_version, skill, counts} — the compatibility contract the transfer will validate. Also: Indexer::reindex_vectors() (drop+recreate hnsw_blocks_vec — the bulk-load fast path; search stays correct via the cosine scan), and Migrator::SCHEMA_VERSION (manual schema-version stand-in for the manifest). Red→green (no mocks): tests/build.rs — two text files + HashEmbedder (offline, 768) → 2 instances, chunks with non-NULL dense_vec, manifest hash@768, blobs promoted to canonical, a non-ingestable extension skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…beddings The import half of the offline batch loader. `Indexer::merge_from_attached` copies `pages`/`links`/`blocks` from an already-`attach_external`ed read-only DuckDB into the live tenant verbatim — `blocks.dense_vec` rows transfer as data, so the live tenant NEVER re-embeds an offline-built corpus. - `OnCollision::{Skip,Replace,Error}` — Skip (default) is additive + idempotent (re-run resumes); Replace delete-then-inserts colliding `page_id`s; Error aborts the whole merge if any source `page_id` already exists. - Bulk-load fast path: DROP HNSW → INSERT ... BY NAME SELECT → recreate HNSW, then refresh_fts() once (BM25 snapshot). `INSERT ... BY NAME` is column-order robust; alias re-validated via is_valid_attach_alias before splicing. - Skip/Error snapshot the pre-existing page_id set into a TEMP table so pages/links/blocks all filter against the same set (pages inserts first). E2E (crates/escurel-index/tests/merge_from_attached.rs, real DuckDB + attach, no mocks): the target's embedder is a ZeroEmbedder, so the assertion that an imported chunk's dense_vec head reads back as the SOURCE's 0.22 (not 0.0) is the no-re-embed proof. Covers skip-keeps-existing, replace-overwrites, error-aborts-on-collision, and that imported chunks are FTS-searchable post-merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…s-first) The files-first half of a transfer, run BEFORE merge_from_attached copies the DuckDB rows. `copy_files(loader_dir, live, live_tenant)`: - copies every canonical blob (content-addressed put_blob → idempotent id), blobs first so chunk rows never reference a missing blob; - re-keys each overlay markdown (markdown/instances/<skill>/<id>.md) verbatim under the live tenant. Ordering guarantees a crash mid-transfer leaves at worst orphan blobs (which audit_documents/reclaim_orphan_blobs sweep), never rows pointing at absent content. Re-runs are idempotent. Added From<KeyError> to LoaderError. E2E (crates/escurel-loader/tests/copy_files.rs, real FsStores): a 2-doc loader artifact copies into a fresh live tenant — 2 blobs byte-equal across stores, 2 overlays re-keyed under markdown/instances/, and a second run produces the same tallies with no duplicate blobs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… E2E Completes the offline batch loader as an operator-usable tool. `escurel_loader::transfer(from, to, tenant, expect_model, on_collision)`: validate manifest (model_id == expect_model, dim == 768, schema_version == Migrator::SCHEMA_VERSION) → copy_files (blobs + overlays) → attach loader DuckDB read-only → merge_from_attached (rows). Fails closed on any mismatch BEFORE touching the target — mixing embedding spaces silently wrecks retrieval. Uses a ZeroEmbedder placeholder for the live Indexer since the merge never embeds. New thin binary `escurel-loader` (build|transfer subcommands, JSON output). It lives here, not in escurel-cli, on purpose: the transfer must ATTACH two DuckDB files on disk, which cannot go through the gateway HTTP client surface that escurel-cli is built on. escurel-cli stays a pure presentation layer. Bug found + fixed by the idempotent-re-transfer test: a transfer into an ALREADY-migrated live tenant re-ran Migrator::up and hit "Table pages already exists". Now gates `up` on `fresh = !db_path.exists()` and always runs the idempotent load_extensions + ensure_* — mirroring the server boot recipe (config.rs). Capstone E2E (crates/escurel-loader/tests/transfer.rs, real DuckDB + FsStore + attach, no mocks): build a 2-doc HashEmbedder artifact, transfer into a fresh live dir, assert transferred dense_vec is byte-identical to the loader's (the no-re-embed proof — Hash vectors are non-zero), merged instances are search-hittable (HNSW + FTS), re-transfer is idempotent (all collide, counts stable), and a model-mismatch manifest aborts before any rows move. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- docs/notes/2026-06-23-batch-loader.md — the offline-loader + DuckDB→DuckDB transfer concept: why (quota/cost wall), the architecture, the embedder-model compatibility gate, idempotency + files-first crash-safety, and why it's a separate binary rather than an escurel-cli/MCP surface. - docs/operations.md — "Bulk-loading a large corpus" runbook with the build + transfer command sequence and the --expect-model safety note. - docs/notes/discovered/2026-06-23-loader-transfer-hnsw-fts-and-fresh-migrate.md — the three traps: Migrator::up is fresh-only (gate on !db_path.exists()), HNSW drop/recreate around the bulk insert, FTS refresh-once-after; plus the ZeroEmbedder-placeholder trick for proving no-re-embed in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…icity docs Triaged second-opinion (codex) findings on PR #198. Codex found no SQL-injection hole and no unwrap-on-input; the rest were hardening. Applied the clear-win fixes: - on_collision=error now PREFLIGHTS collisions before copy_files: attach the source read-only up front, count colliding page_ids, and abort with "nothing was copied" BEFORE mutating the live LaneStore. error means "abort if anything collides" — it must not leave files behind. New Indexer::attached_page_collisions (cheap, read-only) backs this. - merge_from_attached: CREATE OR REPLACE TEMP TABLE _existing, so a `_existing` left over from a merge that crashed between CREATE and DROP on the long-lived connection can't wedge every subsequent merge. - Honest docstring: only the row INSERTs are transactional; the HNSW drop/recreate + FTS refresh sit outside it. Documented that both are self-healing (cosine scan works without HNSW; re-run/reindex/rebuild restores them) — row state is never corrupted, only its indexes. New E2E: transfer_error_policy_aborts_on_collision_without_mutating — seed a tenant, then an error-policy transfer aborts with "nothing was copied" and the row state is byte-for-byte unchanged. Deferred (documented follow-ups, not happy-path bugs): persist the live tenant's embedder model_id so the gate is target-trust not operator-asserted --expect-model; read the live DB's actual schema_version rather than only the artifact-vs-binary check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jrosskopf and others added 8 commits June 23, 2026 06:51

jrosskopf merged commit ac89894 into main Jun 23, 2026
1 check passed

jrosskopf deleted the loader/offline-batch branch June 23, 2026 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(loader): offline batch loader + DuckDB→DuckDB transfer (carry embeddings, no re-embed)#198

feat(loader): offline batch loader + DuckDB→DuckDB transfer (carry embeddings, no re-embed)#198
jrosskopf merged 8 commits into
mainfrom
loader/offline-batch

jrosskopf commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jrosskopf commented Jun 23, 2026

Summary

What's in it (7 logical changes, one branch)

Notable decisions / findings

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant