Skip to content

Add a reject class for OOD detection#6

Merged
danielbusnz merged 2 commits into
mainfrom
reject-class-ood
May 29, 2026
Merged

Add a reject class for OOD detection#6
danielbusnz merged 2 commits into
mainfrom
reject-class-ood

Conversation

@danielbusnz

Copy link
Copy Markdown
Owner

Fixes the overconfidence problem the report surfaced: routelet labeled gibberish as a real intent at ~98% confidence, so the confidence gate caught almost no OOD. A sixth none class, trained on generated OOD (Scripts/gen_ood.py), lets the model flag junk directly. Results on the hand-written probes: 80% OOD rejected vs 17% for the confidence gate, 0 false rejects on the holdout, holdout accuracy unchanged at 0.925. The teacher/eval never emit none; it's learned only from generated data. Report now shows the reject-class vs confidence-gate comparison. Paired with the aegis wiring (Intent::None -> defer to Claude).

A sixth "none" label, trained on generated out-of-distribution data
(Scripts/gen_ood.py), lets routelet flag gibberish instead of confidently
mislabeling it. On the hand-written probes it rejects 80% of OOD vs 17% for
the old confidence gate, with zero false rejects on the holdout and holdout
accuracy unchanged at 0.925. The teacher never emits none.
report.py now contrasts the confidence gate and the reject class on OOD
caught vs real commands wrongly deferred. Drops the confidence histogram and
deferral-tradeoff figures, which described the old gate mechanism.
@danielbusnz danielbusnz merged commit c3b9a36 into main May 29, 2026
3 checks passed
@danielbusnz danielbusnz deleted the reject-class-ood branch May 29, 2026 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant