Land the SetFit/ONNX intent router and data pipeline by danielbusnz · Pull Request #1 · danielbusnz/routelet

danielbusnz · 2026-05-29T13:55:41Z

Why

main still holds only the project scaffold. Everything that makes routelet an
actual on-device intent router was built on setfit-onnx-export: the trainer,
the ONNX export path, the eval sets, the labeled data, and the distillation data
loop. The branch is eight commits ahead and main is a clean ancestor, so this
graduates that work onto main in one fast-forward.

What

SetFit trainer and the ONNX export path (split export, skew-safe preprocess
shared between train and inference, int8-friendly graph).
Dataset growth and calibration, holdout grown to 200 rows (40 per class), plus
an out-of-distribution tool-routing eval.
CI with hermetic data and augment tests, lint fixes, and docs on how the model
was produced.
A shared routelet.teacher module: the taxonomy prompt and intent schema now
live in one place, imported by both the eval baseline and the new labeler so
they can't drift.
The distillation data pipeline: Scripts/pull_samples.py compacts the proxy's
R2 sample objects into one JSONL, and Scripts/ingest_samples.py labels them
(the free in-turn Claude label, or the offline teacher for the rest), drops
pool and eval-set duplicates, and writes data/collected.jsonl.

How

This is a long-lived feature branch catching main up, not a single tidy
concern, so the commits are kept atomic and a rebase-merge preserves the history
for git bisect. The pipeline scripts depend on boto3 and anthropic, which
are installed into the venv as tooling rather than declared in pyproject.toml.
Raw pulled samples land in data/raw/, which is gitignored.

Test plan

pytest: 13 passed (2397 subtests).
ruff check clean on the changed files.
ingest_samples.py --dry-run exercised on fixtures: client-label, teacher
routing, dedup, eval-leak, and bad-label paths all behave.

train_setfit.py trains the bge-small SetFit model and saves it to models/setfit so it can be exported, unlike setfit_baseline.py which trains then discards the model. Holdout accuracy is 0.77 against the 0.60 TF-IDF floor. export_onnx.py reads the label order from the fitted head, writes labels.json, runs a PyTorch-vs-ONNX parity check, and prints the graph I/O. The fused export itself is still blocked: setfit 1.1.3's export_onnx cannot merge the opset-18 encoder with the opset-13 sklearn head on torch 2.12. Next step is a separate encoder ONNX plus a Rust-side head.

Replace the broken fused SetFit export with a split pipeline: export the encoder to embedder.onnx (CLS pooling and L2 norm baked in, opset 14 so tract can load it) and dump the LR head to head.json for the consumer to apply. Add preprocess.py, one intent-independent normalization (secret-keyword, email, digit-run redaction) applied at both training and inference so there is no train/serve skew. A conformance fixture in tests/ pins it; an identical copy lives in the aegis repo. Add augment.py to generate disfluent voice-style variants into data/augmented.jsonl, and expand the holdout to 124 balanced rows. Train with unique sampling, two epochs, and a class-balanced head, then fit a temperature scalar (written to head.json) for confidence calibration. Holdout 0.69 -> 0.87.

Add a GitHub Actions workflow that runs ruff and the unittest suite on the base deps only (no torch/setfit), plus two stdlib-unittest files: data validity (every training and holdout row is valid JSON with a known intent, no duplicate training texts, zero overlap between training and the holdout) and augmenter checks (deterministic, label-preserving, no holdout leakage, no chaining of integration/find_action). Wrap long lines and rename an ambiguous variable so ruff passes clean.

Add 76 novel disfluent rows so every intent has 40 eval examples, tightening the per-class confidence intervals (find_action was 23). Verified zero duplicates and zero overlap with training. Current model scores 0.89 on it.

Add docs/model.md: the end-to-end build (data, preprocess, SetFit train, calibration, eval, ONNX export) with the honest holdout numbers and the tool-routing exploration that did not pan out (retrieval failed, classify-into- tools hit 0.63 out-of-distribution, not shipped). Fix the data-validity tests: data/*.jsonl is gitignored, so skip the data-pool checks when it is absent (CI) instead of failing. The holdout check still runs. Add evals/tool_routing_ood.jsonl, the out-of-distribution tool probe.

evaluate.py and the new ingest labeler import SYSTEM and the intent schema from routelet.teacher, so the eval and the labeler can't drift.

pull_samples.py compacts the proxy's R2 objects into one JSONL. ingest_samples.py labels them (the free Claude-fallback label or the offline teacher), drops pool and eval-set duplicates, and writes data/collected.jsonl.

report.py scores TF-IDF, the shipped SetFit model, and Claude Haiku on the frozen 200-row holdout and writes the model-comparison figure plus metrics.json. refresh_haiku_baseline.py re-runs the paid Haiku eval and caches the result so regenerating the report does not re-spend.

Remove the unused data.split(). Move the web-server deps (fastapi, uvicorn, pydantic) to an optional 'serve' extra since nothing imports them yet. Move the tool-routing prototypes and their OOD eval into experiments/, excluded from ruff. Rewrite the README file map and fix train.py's stale docstring.

danielbusnz added 10 commits May 27, 2026 09:53

Document the planned performance report

ac4e6f8

Grow holdout to 200 rows, 40 per class

ebb87a6

Add 76 novel disfluent rows so every intent has 40 eval examples, tightening the per-class confidence intervals (find_action was 23). Verified zero duplicates and zero overlap with training. Current model scores 0.89 on it.

Extract the teacher prompt into a shared module

9864ecf

evaluate.py and the new ingest labeler import SYSTEM and the intent schema from routelet.teacher, so the eval and the labeler can't drift.

Add R2 sample pull and teacher-label ingest

d17f504

pull_samples.py compacts the proxy's R2 objects into one JSONL. ingest_samples.py labels them (the free Claude-fallback label or the offline teacher), drops pool and eval-set duplicates, and writes data/collected.jsonl.

danielbusnz merged commit 224b239 into main May 29, 2026
3 checks passed

danielbusnz mentioned this pull request May 29, 2026

Tidy comments, remove serve.py, sharpen teacher prompt #2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Land the SetFit/ONNX intent router and data pipeline#1

Land the SetFit/ONNX intent router and data pipeline#1
danielbusnz merged 10 commits into
mainfrom
setfit-onnx-export

danielbusnz commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielbusnz commented May 29, 2026

Why

What

How

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant