feat(loader): offline batch loader + DuckDB→DuckDB transfer (carry embeddings, no re-embed)#198
Merged
Conversation
The offline batch loader + DuckDB->DuckDB transfer must refuse to import vectors produced by a different embedding model (mixing embedding spaces silently destroys retrieval). The Embedder trait exposed only dim(); add model_id() so (model_id, dim) identifies the space, recorded in the loader's artifact manifest and validated on transfer. - trait: fn model_id(&self) -> String (owned, so ReloadableEmbedder can delegate through its ArcSwap guard; called rarely). - impls: Zero="zero", Hash="hash", Gemini=its model name, Candle=the repo/model id (new Inner.model_id field threaded through from_local/from_hf_hub), Reloadable delegates to the live inner. - test SlowEmbedder (escurel-index) updated. Red→green: model_id_tests assert stable, distinct ids (+ gemini under feature). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extract Indexer::write_document_blocks(page_id, overlay, chunks, vectors) — the embed-free write half — and make materialize_document = embed then write_document_blocks. This lets the offline batch loader supply vectors from a separate (e.g. Gemini Batch API) pass, and underpins the no-re-embed guarantee: the write path stores the given vectors verbatim, never consulting the embedder. Validates vectors.len()==chunks.len() and each dim==768, fail-closed. Behaviour of materialize_document is unchanged (existing document tests green). Red→green (no mocks): tests/write_document_blocks.rs — with a ZeroEmbedder indexer, writing distinct non-zero sentinel vectors and reopening the DuckDB file reads them back verbatim (would be zero if it re-embedded); count-mismatch fails closed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
New crate `escurel-loader`: a LoaderBuilder that runs the existing
extract→chunk→embed→materialise pipeline (DocumentIngestWorker →
Indexer::materialize_document) IN-PROCESS against a throwaway loader instance
(its own DuckDB + blob dir), at full speed with no HTTP/quota. Walks a source
dir, infers MIME by extension, deposits each file as an inbox blob, and
materialises it under the given document skill. instance_id = content sha256
(idempotent + dedup). Rebuilds the HNSW vector index + BM25 FTS once at the
end. Writes manifest.json {model_id, dim, schema_version, skill, counts} — the
compatibility contract the transfer will validate.
Also: Indexer::reindex_vectors() (drop+recreate hnsw_blocks_vec — the
bulk-load fast path; search stays correct via the cosine scan), and
Migrator::SCHEMA_VERSION (manual schema-version stand-in for the manifest).
Red→green (no mocks): tests/build.rs — two text files + HashEmbedder (offline,
768) → 2 instances, chunks with non-NULL dense_vec, manifest hash@768, blobs
promoted to canonical, a non-ingestable extension skipped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…beddings
The import half of the offline batch loader. `Indexer::merge_from_attached`
copies `pages`/`links`/`blocks` from an already-`attach_external`ed read-only
DuckDB into the live tenant verbatim — `blocks.dense_vec` rows transfer as data,
so the live tenant NEVER re-embeds an offline-built corpus.
- `OnCollision::{Skip,Replace,Error}` — Skip (default) is additive + idempotent
(re-run resumes); Replace delete-then-inserts colliding `page_id`s; Error
aborts the whole merge if any source `page_id` already exists.
- Bulk-load fast path: DROP HNSW → INSERT ... BY NAME SELECT → recreate HNSW,
then refresh_fts() once (BM25 snapshot). `INSERT ... BY NAME` is column-order
robust; alias re-validated via is_valid_attach_alias before splicing.
- Skip/Error snapshot the pre-existing page_id set into a TEMP table so
pages/links/blocks all filter against the same set (pages inserts first).
E2E (crates/escurel-index/tests/merge_from_attached.rs, real DuckDB + attach,
no mocks): the target's embedder is a ZeroEmbedder, so the assertion that an
imported chunk's dense_vec head reads back as the SOURCE's 0.22 (not 0.0) is the
no-re-embed proof. Covers skip-keeps-existing, replace-overwrites,
error-aborts-on-collision, and that imported chunks are FTS-searchable post-merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…s-first) The files-first half of a transfer, run BEFORE merge_from_attached copies the DuckDB rows. `copy_files(loader_dir, live, live_tenant)`: - copies every canonical blob (content-addressed put_blob → idempotent id), blobs first so chunk rows never reference a missing blob; - re-keys each overlay markdown (markdown/instances/<skill>/<id>.md) verbatim under the live tenant. Ordering guarantees a crash mid-transfer leaves at worst orphan blobs (which audit_documents/reclaim_orphan_blobs sweep), never rows pointing at absent content. Re-runs are idempotent. Added From<KeyError> to LoaderError. E2E (crates/escurel-loader/tests/copy_files.rs, real FsStores): a 2-doc loader artifact copies into a fresh live tenant — 2 blobs byte-equal across stores, 2 overlays re-keyed under markdown/instances/, and a second run produces the same tallies with no duplicate blobs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… E2E Completes the offline batch loader as an operator-usable tool. `escurel_loader::transfer(from, to, tenant, expect_model, on_collision)`: validate manifest (model_id == expect_model, dim == 768, schema_version == Migrator::SCHEMA_VERSION) → copy_files (blobs + overlays) → attach loader DuckDB read-only → merge_from_attached (rows). Fails closed on any mismatch BEFORE touching the target — mixing embedding spaces silently wrecks retrieval. Uses a ZeroEmbedder placeholder for the live Indexer since the merge never embeds. New thin binary `escurel-loader` (build|transfer subcommands, JSON output). It lives here, not in escurel-cli, on purpose: the transfer must ATTACH two DuckDB files on disk, which cannot go through the gateway HTTP client surface that escurel-cli is built on. escurel-cli stays a pure presentation layer. Bug found + fixed by the idempotent-re-transfer test: a transfer into an ALREADY-migrated live tenant re-ran Migrator::up and hit "Table pages already exists". Now gates `up` on `fresh = !db_path.exists()` and always runs the idempotent load_extensions + ensure_* — mirroring the server boot recipe (config.rs). Capstone E2E (crates/escurel-loader/tests/transfer.rs, real DuckDB + FsStore + attach, no mocks): build a 2-doc HashEmbedder artifact, transfer into a fresh live dir, assert transferred dense_vec is byte-identical to the loader's (the no-re-embed proof — Hash vectors are non-zero), merged instances are search-hittable (HNSW + FTS), re-transfer is idempotent (all collide, counts stable), and a model-mismatch manifest aborts before any rows move. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- docs/notes/2026-06-23-batch-loader.md — the offline-loader + DuckDB→DuckDB transfer concept: why (quota/cost wall), the architecture, the embedder-model compatibility gate, idempotency + files-first crash-safety, and why it's a separate binary rather than an escurel-cli/MCP surface. - docs/operations.md — "Bulk-loading a large corpus" runbook with the build + transfer command sequence and the --expect-model safety note. - docs/notes/discovered/2026-06-23-loader-transfer-hnsw-fts-and-fresh-migrate.md — the three traps: Migrator::up is fresh-only (gate on !db_path.exists()), HNSW drop/recreate around the bulk insert, FTS refresh-once-after; plus the ZeroEmbedder-placeholder trick for proving no-re-embed in tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…icity docs Triaged second-opinion (codex) findings on PR #198. Codex found no SQL-injection hole and no unwrap-on-input; the rest were hardening. Applied the clear-win fixes: - on_collision=error now PREFLIGHTS collisions before copy_files: attach the source read-only up front, count colliding page_ids, and abort with "nothing was copied" BEFORE mutating the live LaneStore. error means "abort if anything collides" — it must not leave files behind. New Indexer::attached_page_collisions (cheap, read-only) backs this. - merge_from_attached: CREATE OR REPLACE TEMP TABLE _existing, so a `_existing` left over from a merge that crashed between CREATE and DROP on the long-lived connection can't wedge every subsequent merge. - Honest docstring: only the row INSERTs are transactional; the HNSW drop/recreate + FTS refresh sit outside it. Documented that both are self-healing (cosine scan works without HNSW; re-run/reindex/rebuild restores them) — row state is never corrupted, only its indexes. New E2E: transfer_error_policy_aborts_on_collision_without_mutating — seed a tenant, then an error-policy transfer aborts with "nothing was copied" and the row state is byte-for-byte unchanged. Deferred (documented follow-ups, not happy-path bugs): persist the live tenant's embedder model_id so the gate is target-trust not operator-asserted --expect-model; read the live DB's actual schema_version rather than only the artifact-vs-binary check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Offline batch loader + DuckDB→DuckDB transfer that loads a large corpus (e.g.
~20k PDFs) without hitting the live server's per-tenant Embeds quota or paying
to re-embed in production. The expensive step (extract → chunk → embed)
happens once, offline, in a throwaway loader instance; the result transfers
into the live tenant carrying the embeddings as data — prod never re-embeds.
Design note:
docs/notes/2026-06-23-batch-loader.md.Runbook:
docs/operations.md→ "Bulk-loading a large corpus".What's in it (7 logical changes, one branch)
Embedder::model_id()— pins the embedding space (Zero/Hash/Candle/Gemini/Reloadable).materialize_documentintoembed → write_document_blocks(page_id, overlay, chunks, vectors)— lets the transfer supply precomputed vectors (the no-re-embed path).escurel-loadercrate —LoaderBuilderruns the real extract→chunk→embed→materialise pipeline in-process against a standalone DuckDB; writesmanifest.json(model_id/dim/schema_version). AddsIndexer::reindex_vectors().Indexer::merge_from_attached(alias, on_collision)— the DuckDB→DuckDB transfer: drop HNSW →INSERT … BY NAME SELECT(skip/replace/error) → recreate HNSW → refresh FTS. Vectors copy verbatim.copy_files— blobs + overlay markdown LaneStore→LaneStore, files-first (crash-safe ordering).transferfn +escurel-loaderoperator binary — validates the manifest against the live tenant's--expect-modelbefore touching anything; fails closed on mismatch.Notable decisions / findings
escurel-loader, notescurel-cli: the transfer mustATTACHtwo DuckDB files on disk, which can't go through the gateway HTTP client surface thatescurel-cliis built on.escurel-clistays a pure presentation layer. (Deviation from the plan's "escurel-cli/src/admin.rs" — host-side disk ops are a hard mismatch for the all-HTTP CLI.)Migrator::up→ "Table pages already exists". Now gatesuponfresh = !db_path.exists(), mirroring the server boot recipe.--embedderalready abstracts the space and the manifest gate makes addinggemini-batch:<model>safe later.Test plan
All no-mock E2E (real DuckDB files, real FsStore, real
ATTACH, real extraction):crates/escurel-index/tests/write_document_blocks.rs— supplied vectors stored byte-equal; count-mismatch fails closed.crates/escurel-index/tests/merge_from_attached.rs::{merge_skip_imports_new_pages_keeps_existing_and_copies_vectors_verbatim, merge_replace_overwrites_colliding_pages, merge_error_aborts_on_collision}— target uses a ZeroEmbedder so the byte-identical imported vector is the no-re-embed proof.crates/escurel-loader/tests/build.rs— loader build materialises docs with non-NULL vectors + manifest.crates/escurel-loader/tests/copy_files.rs— blobs byte-equal + overlays re-keyed + idempotent.crates/escurel-loader/tests/transfer.rs::{transfer_carries_embeddings_verbatim_and_is_idempotent, transfer_aborts_on_embedder_model_mismatch}— capstone: build (HashEmbedder, non-zero vectors) → transfer → livedense_vecbyte-identical to loader's, instances search-hittable, re-transfer idempotent, model-mismatch aborts before any rows move.model_idunit tests acrossescurel-embed.Four-check gate green locally:
cargo fmt --check,cargo clippy --workspace --all-targets -- -D warnings,cargo test --workspace --all-targets,cargo build --workspace --release.🤖 Generated with Claude Code