Skip to content

feat(loader): offline batch loader + DuckDB→DuckDB transfer (carry embeddings, no re-embed)#198

Merged
jrosskopf merged 8 commits into
mainfrom
loader/offline-batch
Jun 23, 2026
Merged

feat(loader): offline batch loader + DuckDB→DuckDB transfer (carry embeddings, no re-embed)#198
jrosskopf merged 8 commits into
mainfrom
loader/offline-batch

Conversation

@jrosskopf

Copy link
Copy Markdown
Contributor

Summary

Offline batch loader + DuckDB→DuckDB transfer that loads a large corpus (e.g.
~20k PDFs) without hitting the live server's per-tenant Embeds quota or paying
to re-embed in production. The expensive step (extract → chunk → embed)
happens once, offline, in a throwaway loader instance; the result transfers
into the live tenant carrying the embeddings as data — prod never re-embeds.

Design note: docs/notes/2026-06-23-batch-loader.md.
Runbook: docs/operations.md → "Bulk-loading a large corpus".

escurel-loader build    --src <dir> --out <loader_dir> --skill <name> --embedder hash
escurel-loader transfer --from <loader_dir> --to <live_data_dir> \
                        --tenant <t> --expect-model hash --on-collision skip

What's in it (7 logical changes, one branch)

  1. Embedder::model_id() — pins the embedding space (Zero/Hash/Candle/Gemini/Reloadable).
  2. Split materialize_document into embed → write_document_blocks(page_id, overlay, chunks, vectors) — lets the transfer supply precomputed vectors (the no-re-embed path).
  3. escurel-loader crateLoaderBuilder runs the real extract→chunk→embed→materialise pipeline in-process against a standalone DuckDB; writes manifest.json (model_id/dim/schema_version). Adds Indexer::reindex_vectors().
  4. Indexer::merge_from_attached(alias, on_collision) — the DuckDB→DuckDB transfer: drop HNSW → INSERT … BY NAME SELECT (skip/replace/error) → recreate HNSW → refresh FTS. Vectors copy verbatim.
  5. copy_files — blobs + overlay markdown LaneStore→LaneStore, files-first (crash-safe ordering).
  6. transfer fn + escurel-loader operator binary — validates the manifest against the live tenant's --expect-model before touching anything; fails closed on mismatch.
  7. Docs — design note, ops runbook, discovered note.

Notable decisions / findings

  • The binary lives in escurel-loader, not escurel-cli: the transfer must ATTACH two DuckDB files on disk, which can't go through the gateway HTTP client surface that escurel-cli is built on. escurel-cli stays a pure presentation layer. (Deviation from the plan's "escurel-cli/src/admin.rs" — host-side disk ops are a hard mismatch for the all-HTTP CLI.)
  • Bug fixed (caught by the idempotent re-transfer test): a transfer into an already-migrated tenant re-ran Migrator::up → "Table pages already exists". Now gates up on fresh = !db_path.exists(), mirroring the server boot recipe.
  • Gemini Batch-API embedder (~half price) is deferred — --embedder already abstracts the space and the manifest gate makes adding gemini-batch:<model> safe later.

Test plan

All no-mock E2E (real DuckDB files, real FsStore, real ATTACH, real extraction):

  • crates/escurel-index/tests/write_document_blocks.rs — supplied vectors stored byte-equal; count-mismatch fails closed.
  • crates/escurel-index/tests/merge_from_attached.rs::{merge_skip_imports_new_pages_keeps_existing_and_copies_vectors_verbatim, merge_replace_overwrites_colliding_pages, merge_error_aborts_on_collision} — target uses a ZeroEmbedder so the byte-identical imported vector is the no-re-embed proof.
  • crates/escurel-loader/tests/build.rs — loader build materialises docs with non-NULL vectors + manifest.
  • crates/escurel-loader/tests/copy_files.rs — blobs byte-equal + overlays re-keyed + idempotent.
  • crates/escurel-loader/tests/transfer.rs::{transfer_carries_embeddings_verbatim_and_is_idempotent, transfer_aborts_on_embedder_model_mismatch} — capstone: build (HashEmbedder, non-zero vectors) → transfer → live dense_vec byte-identical to loader's, instances search-hittable, re-transfer idempotent, model-mismatch aborts before any rows move.
  • Plus model_id unit tests across escurel-embed.

Four-check gate green locally: cargo fmt --check, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace --all-targets, cargo build --workspace --release.

🤖 Generated with Claude Code

jrosskopf and others added 8 commits June 23, 2026 06:51
The offline batch loader + DuckDB->DuckDB transfer must refuse to import
vectors produced by a different embedding model (mixing embedding spaces
silently destroys retrieval). The Embedder trait exposed only dim(); add
model_id() so (model_id, dim) identifies the space, recorded in the loader's
artifact manifest and validated on transfer.

- trait: fn model_id(&self) -> String (owned, so ReloadableEmbedder can
  delegate through its ArcSwap guard; called rarely).
- impls: Zero="zero", Hash="hash", Gemini=its model name, Candle=the repo/model
  id (new Inner.model_id field threaded through from_local/from_hf_hub),
  Reloadable delegates to the live inner.
- test SlowEmbedder (escurel-index) updated.

Red→green: model_id_tests assert stable, distinct ids (+ gemini under feature).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extract Indexer::write_document_blocks(page_id, overlay, chunks, vectors) — the
embed-free write half — and make materialize_document = embed then
write_document_blocks. This lets the offline batch loader supply vectors from a
separate (e.g. Gemini Batch API) pass, and underpins the no-re-embed guarantee:
the write path stores the given vectors verbatim, never consulting the embedder.
Validates vectors.len()==chunks.len() and each dim==768, fail-closed.

Behaviour of materialize_document is unchanged (existing document tests green).

Red→green (no mocks): tests/write_document_blocks.rs — with a ZeroEmbedder
indexer, writing distinct non-zero sentinel vectors and reopening the DuckDB
file reads them back verbatim (would be zero if it re-embedded); count-mismatch
fails closed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New crate `escurel-loader`: a LoaderBuilder that runs the existing
extract→chunk→embed→materialise pipeline (DocumentIngestWorker →
Indexer::materialize_document) IN-PROCESS against a throwaway loader instance
(its own DuckDB + blob dir), at full speed with no HTTP/quota. Walks a source
dir, infers MIME by extension, deposits each file as an inbox blob, and
materialises it under the given document skill. instance_id = content sha256
(idempotent + dedup). Rebuilds the HNSW vector index + BM25 FTS once at the
end. Writes manifest.json {model_id, dim, schema_version, skill, counts} — the
compatibility contract the transfer will validate.

Also: Indexer::reindex_vectors() (drop+recreate hnsw_blocks_vec — the
bulk-load fast path; search stays correct via the cosine scan), and
Migrator::SCHEMA_VERSION (manual schema-version stand-in for the manifest).

Red→green (no mocks): tests/build.rs — two text files + HashEmbedder (offline,
768) → 2 instances, chunks with non-NULL dense_vec, manifest hash@768, blobs
promoted to canonical, a non-ingestable extension skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…beddings

The import half of the offline batch loader. `Indexer::merge_from_attached`
copies `pages`/`links`/`blocks` from an already-`attach_external`ed read-only
DuckDB into the live tenant verbatim — `blocks.dense_vec` rows transfer as data,
so the live tenant NEVER re-embeds an offline-built corpus.

- `OnCollision::{Skip,Replace,Error}` — Skip (default) is additive + idempotent
  (re-run resumes); Replace delete-then-inserts colliding `page_id`s; Error
  aborts the whole merge if any source `page_id` already exists.
- Bulk-load fast path: DROP HNSW → INSERT ... BY NAME SELECT → recreate HNSW,
  then refresh_fts() once (BM25 snapshot). `INSERT ... BY NAME` is column-order
  robust; alias re-validated via is_valid_attach_alias before splicing.
- Skip/Error snapshot the pre-existing page_id set into a TEMP table so
  pages/links/blocks all filter against the same set (pages inserts first).

E2E (crates/escurel-index/tests/merge_from_attached.rs, real DuckDB + attach,
no mocks): the target's embedder is a ZeroEmbedder, so the assertion that an
imported chunk's dense_vec head reads back as the SOURCE's 0.22 (not 0.0) is the
no-re-embed proof. Covers skip-keeps-existing, replace-overwrites,
error-aborts-on-collision, and that imported chunks are FTS-searchable post-merge.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s-first)

The files-first half of a transfer, run BEFORE merge_from_attached copies the
DuckDB rows. `copy_files(loader_dir, live, live_tenant)`:

- copies every canonical blob (content-addressed put_blob → idempotent id),
  blobs first so chunk rows never reference a missing blob;
- re-keys each overlay markdown (markdown/instances/<skill>/<id>.md) verbatim
  under the live tenant.

Ordering guarantees a crash mid-transfer leaves at worst orphan blobs (which
audit_documents/reclaim_orphan_blobs sweep), never rows pointing at absent
content. Re-runs are idempotent. Added From<KeyError> to LoaderError.

E2E (crates/escurel-loader/tests/copy_files.rs, real FsStores): a 2-doc loader
artifact copies into a fresh live tenant — 2 blobs byte-equal across stores, 2
overlays re-keyed under markdown/instances/, and a second run produces the same
tallies with no duplicate blobs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… E2E

Completes the offline batch loader as an operator-usable tool.

`escurel_loader::transfer(from, to, tenant, expect_model, on_collision)`:
validate manifest (model_id == expect_model, dim == 768, schema_version ==
Migrator::SCHEMA_VERSION) → copy_files (blobs + overlays) → attach loader DuckDB
read-only → merge_from_attached (rows). Fails closed on any mismatch BEFORE
touching the target — mixing embedding spaces silently wrecks retrieval. Uses a
ZeroEmbedder placeholder for the live Indexer since the merge never embeds.

New thin binary `escurel-loader` (build|transfer subcommands, JSON output). It
lives here, not in escurel-cli, on purpose: the transfer must ATTACH two DuckDB
files on disk, which cannot go through the gateway HTTP client surface that
escurel-cli is built on. escurel-cli stays a pure presentation layer.

Bug found + fixed by the idempotent-re-transfer test: a transfer into an
ALREADY-migrated live tenant re-ran Migrator::up and hit "Table pages already
exists". Now gates `up` on `fresh = !db_path.exists()` and always runs the
idempotent load_extensions + ensure_* — mirroring the server boot recipe
(config.rs).

Capstone E2E (crates/escurel-loader/tests/transfer.rs, real DuckDB + FsStore +
attach, no mocks): build a 2-doc HashEmbedder artifact, transfer into a fresh
live dir, assert transferred dense_vec is byte-identical to the loader's (the
no-re-embed proof — Hash vectors are non-zero), merged instances are
search-hittable (HNSW + FTS), re-transfer is idempotent (all collide, counts
stable), and a model-mismatch manifest aborts before any rows move.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- docs/notes/2026-06-23-batch-loader.md — the offline-loader + DuckDB→DuckDB
  transfer concept: why (quota/cost wall), the architecture, the embedder-model
  compatibility gate, idempotency + files-first crash-safety, and why it's a
  separate binary rather than an escurel-cli/MCP surface.
- docs/operations.md — "Bulk-loading a large corpus" runbook with the
  build + transfer command sequence and the --expect-model safety note.
- docs/notes/discovered/2026-06-23-loader-transfer-hnsw-fts-and-fresh-migrate.md
  — the three traps: Migrator::up is fresh-only (gate on !db_path.exists()),
  HNSW drop/recreate around the bulk insert, FTS refresh-once-after; plus the
  ZeroEmbedder-placeholder trick for proving no-re-embed in tests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…icity docs

Triaged second-opinion (codex) findings on PR #198. Codex found no SQL-injection
hole and no unwrap-on-input; the rest were hardening. Applied the clear-win fixes:

- on_collision=error now PREFLIGHTS collisions before copy_files: attach the
  source read-only up front, count colliding page_ids, and abort with "nothing
  was copied" BEFORE mutating the live LaneStore. error means "abort if anything
  collides" — it must not leave files behind. New Indexer::attached_page_collisions
  (cheap, read-only) backs this.
- merge_from_attached: CREATE OR REPLACE TEMP TABLE _existing, so a `_existing`
  left over from a merge that crashed between CREATE and DROP on the long-lived
  connection can't wedge every subsequent merge.
- Honest docstring: only the row INSERTs are transactional; the HNSW
  drop/recreate + FTS refresh sit outside it. Documented that both are
  self-healing (cosine scan works without HNSW; re-run/reindex/rebuild restores
  them) — row state is never corrupted, only its indexes.

New E2E: transfer_error_policy_aborts_on_collision_without_mutating — seed a
tenant, then an error-policy transfer aborts with "nothing was copied" and the
row state is byte-for-byte unchanged.

Deferred (documented follow-ups, not happy-path bugs): persist the live tenant's
embedder model_id so the gate is target-trust not operator-asserted --expect-model;
read the live DB's actual schema_version rather than only the artifact-vs-binary
check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jrosskopf jrosskopf merged commit ac89894 into main Jun 23, 2026
1 check passed
@jrosskopf jrosskopf deleted the loader/offline-batch branch June 23, 2026 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant