Skip to content

Land the SetFit/ONNX intent router and data pipeline#1

Merged
danielbusnz merged 10 commits into
mainfrom
setfit-onnx-export
May 29, 2026
Merged

Land the SetFit/ONNX intent router and data pipeline#1
danielbusnz merged 10 commits into
mainfrom
setfit-onnx-export

Conversation

@danielbusnz

Copy link
Copy Markdown
Owner

Why

main still holds only the project scaffold. Everything that makes routelet an
actual on-device intent router was built on setfit-onnx-export: the trainer,
the ONNX export path, the eval sets, the labeled data, and the distillation data
loop. The branch is eight commits ahead and main is a clean ancestor, so this
graduates that work onto main in one fast-forward.

What

  • SetFit trainer and the ONNX export path (split export, skew-safe preprocess
    shared between train and inference, int8-friendly graph).
  • Dataset growth and calibration, holdout grown to 200 rows (40 per class), plus
    an out-of-distribution tool-routing eval.
  • CI with hermetic data and augment tests, lint fixes, and docs on how the model
    was produced.
  • A shared routelet.teacher module: the taxonomy prompt and intent schema now
    live in one place, imported by both the eval baseline and the new labeler so
    they can't drift.
  • The distillation data pipeline: Scripts/pull_samples.py compacts the proxy's
    R2 sample objects into one JSONL, and Scripts/ingest_samples.py labels them
    (the free in-turn Claude label, or the offline teacher for the rest), drops
    pool and eval-set duplicates, and writes data/collected.jsonl.

How

This is a long-lived feature branch catching main up, not a single tidy
concern, so the commits are kept atomic and a rebase-merge preserves the history
for git bisect. The pipeline scripts depend on boto3 and anthropic, which
are installed into the venv as tooling rather than declared in pyproject.toml.
Raw pulled samples land in data/raw/, which is gitignored.

Test plan

  • pytest: 13 passed (2397 subtests).
  • ruff check clean on the changed files.
  • ingest_samples.py --dry-run exercised on fixtures: client-label, teacher
    routing, dedup, eval-leak, and bad-label paths all behave.

train_setfit.py trains the bge-small SetFit model and saves it to
models/setfit so it can be exported, unlike setfit_baseline.py which
trains then discards the model. Holdout accuracy is 0.77 against the
0.60 TF-IDF floor.

export_onnx.py reads the label order from the fitted head, writes
labels.json, runs a PyTorch-vs-ONNX parity check, and prints the graph
I/O. The fused export itself is still blocked: setfit 1.1.3's export_onnx
cannot merge the opset-18 encoder with the opset-13 sklearn head on torch
2.12. Next step is a separate encoder ONNX plus a Rust-side head.
Replace the broken fused SetFit export with a split pipeline: export the
encoder to embedder.onnx (CLS pooling and L2 norm baked in, opset 14 so tract
can load it) and dump the LR head to head.json for the consumer to apply.

Add preprocess.py, one intent-independent normalization (secret-keyword,
email, digit-run redaction) applied at both training and inference so there
is no train/serve skew. A conformance fixture in tests/ pins it; an identical
copy lives in the aegis repo.

Add augment.py to generate disfluent voice-style variants into
data/augmented.jsonl, and expand the holdout to 124 balanced rows. Train with
unique sampling, two epochs, and a class-balanced head, then fit a temperature
scalar (written to head.json) for confidence calibration. Holdout 0.69 -> 0.87.
Add a GitHub Actions workflow that runs ruff and the unittest suite on the
base deps only (no torch/setfit), plus two stdlib-unittest files: data
validity (every training and holdout row is valid JSON with a known intent,
no duplicate training texts, zero overlap between training and the holdout)
and augmenter checks (deterministic, label-preserving, no holdout leakage,
no chaining of integration/find_action). Wrap long lines and rename an
ambiguous variable so ruff passes clean.
Add 76 novel disfluent rows so every intent has 40 eval examples, tightening
the per-class confidence intervals (find_action was 23). Verified zero
duplicates and zero overlap with training. Current model scores 0.89 on it.
Add docs/model.md: the end-to-end build (data, preprocess, SetFit train,
calibration, eval, ONNX export) with the honest holdout numbers and the
tool-routing exploration that did not pan out (retrieval failed, classify-into-
tools hit 0.63 out-of-distribution, not shipped).

Fix the data-validity tests: data/*.jsonl is gitignored, so skip the data-pool
checks when it is absent (CI) instead of failing. The holdout check still runs.

Add evals/tool_routing_ood.jsonl, the out-of-distribution tool probe.
evaluate.py and the new ingest labeler import SYSTEM and the intent
schema from routelet.teacher, so the eval and the labeler can't drift.
pull_samples.py compacts the proxy's R2 objects into one JSONL.
ingest_samples.py labels them (the free Claude-fallback label or the
offline teacher), drops pool and eval-set duplicates, and writes
data/collected.jsonl.
report.py scores TF-IDF, the shipped SetFit model, and Claude Haiku on
the frozen 200-row holdout and writes the model-comparison figure plus
metrics.json. refresh_haiku_baseline.py re-runs the paid Haiku eval and
caches the result so regenerating the report does not re-spend.
Remove the unused data.split(). Move the web-server deps (fastapi,
uvicorn, pydantic) to an optional 'serve' extra since nothing imports
them yet. Move the tool-routing prototypes and their OOD eval into
experiments/, excluded from ruff. Rewrite the README file map and fix
train.py's stale docstring.
@danielbusnz danielbusnz merged commit 224b239 into main May 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant