Land the SetFit/ONNX intent router and data pipeline#1
Merged
Conversation
train_setfit.py trains the bge-small SetFit model and saves it to models/setfit so it can be exported, unlike setfit_baseline.py which trains then discards the model. Holdout accuracy is 0.77 against the 0.60 TF-IDF floor. export_onnx.py reads the label order from the fitted head, writes labels.json, runs a PyTorch-vs-ONNX parity check, and prints the graph I/O. The fused export itself is still blocked: setfit 1.1.3's export_onnx cannot merge the opset-18 encoder with the opset-13 sklearn head on torch 2.12. Next step is a separate encoder ONNX plus a Rust-side head.
Replace the broken fused SetFit export with a split pipeline: export the encoder to embedder.onnx (CLS pooling and L2 norm baked in, opset 14 so tract can load it) and dump the LR head to head.json for the consumer to apply. Add preprocess.py, one intent-independent normalization (secret-keyword, email, digit-run redaction) applied at both training and inference so there is no train/serve skew. A conformance fixture in tests/ pins it; an identical copy lives in the aegis repo. Add augment.py to generate disfluent voice-style variants into data/augmented.jsonl, and expand the holdout to 124 balanced rows. Train with unique sampling, two epochs, and a class-balanced head, then fit a temperature scalar (written to head.json) for confidence calibration. Holdout 0.69 -> 0.87.
Add a GitHub Actions workflow that runs ruff and the unittest suite on the base deps only (no torch/setfit), plus two stdlib-unittest files: data validity (every training and holdout row is valid JSON with a known intent, no duplicate training texts, zero overlap between training and the holdout) and augmenter checks (deterministic, label-preserving, no holdout leakage, no chaining of integration/find_action). Wrap long lines and rename an ambiguous variable so ruff passes clean.
Add 76 novel disfluent rows so every intent has 40 eval examples, tightening the per-class confidence intervals (find_action was 23). Verified zero duplicates and zero overlap with training. Current model scores 0.89 on it.
Add docs/model.md: the end-to-end build (data, preprocess, SetFit train, calibration, eval, ONNX export) with the honest holdout numbers and the tool-routing exploration that did not pan out (retrieval failed, classify-into- tools hit 0.63 out-of-distribution, not shipped). Fix the data-validity tests: data/*.jsonl is gitignored, so skip the data-pool checks when it is absent (CI) instead of failing. The holdout check still runs. Add evals/tool_routing_ood.jsonl, the out-of-distribution tool probe.
evaluate.py and the new ingest labeler import SYSTEM and the intent schema from routelet.teacher, so the eval and the labeler can't drift.
pull_samples.py compacts the proxy's R2 objects into one JSONL. ingest_samples.py labels them (the free Claude-fallback label or the offline teacher), drops pool and eval-set duplicates, and writes data/collected.jsonl.
report.py scores TF-IDF, the shipped SetFit model, and Claude Haiku on the frozen 200-row holdout and writes the model-comparison figure plus metrics.json. refresh_haiku_baseline.py re-runs the paid Haiku eval and caches the result so regenerating the report does not re-spend.
Remove the unused data.split(). Move the web-server deps (fastapi, uvicorn, pydantic) to an optional 'serve' extra since nothing imports them yet. Move the tool-routing prototypes and their OOD eval into experiments/, excluded from ruff. Rewrite the README file map and fix train.py's stale docstring.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
mainstill holds only the project scaffold. Everything that makes routelet anactual on-device intent router was built on
setfit-onnx-export: the trainer,the ONNX export path, the eval sets, the labeled data, and the distillation data
loop. The branch is eight commits ahead and
mainis a clean ancestor, so thisgraduates that work onto
mainin one fast-forward.What
preprocessshared between train and inference, int8-friendly graph).
an out-of-distribution tool-routing eval.
was produced.
routelet.teachermodule: the taxonomy prompt and intent schema nowlive in one place, imported by both the eval baseline and the new labeler so
they can't drift.
Scripts/pull_samples.pycompacts the proxy'sR2 sample objects into one JSONL, and
Scripts/ingest_samples.pylabels them(the free in-turn Claude label, or the offline teacher for the rest), drops
pool and eval-set duplicates, and writes
data/collected.jsonl.How
This is a long-lived feature branch catching
mainup, not a single tidyconcern, so the commits are kept atomic and a rebase-merge preserves the history
for
git bisect. The pipeline scripts depend onboto3andanthropic, whichare installed into the venv as tooling rather than declared in
pyproject.toml.Raw pulled samples land in
data/raw/, which is gitignored.Test plan
pytest: 13 passed (2397 subtests).ruff checkclean on the changed files.ingest_samples.py --dry-runexercised on fixtures: client-label, teacherrouting, dedup, eval-leak, and bad-label paths all behave.