Add a reject class for OOD detection#6
Merged
Merged
Conversation
A sixth "none" label, trained on generated out-of-distribution data (Scripts/gen_ood.py), lets routelet flag gibberish instead of confidently mislabeling it. On the hand-written probes it rejects 80% of OOD vs 17% for the old confidence gate, with zero false rejects on the holdout and holdout accuracy unchanged at 0.925. The teacher never emits none.
report.py now contrasts the confidence gate and the reject class on OOD caught vs real commands wrongly deferred. Drops the confidence histogram and deferral-tradeoff figures, which described the old gate mechanism.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the overconfidence problem the report surfaced: routelet labeled gibberish as a real intent at ~98% confidence, so the confidence gate caught almost no OOD. A sixth
noneclass, trained on generated OOD (Scripts/gen_ood.py), lets the model flag junk directly. Results on the hand-written probes: 80% OOD rejected vs 17% for the confidence gate, 0 false rejects on the holdout, holdout accuracy unchanged at 0.925. The teacher/eval never emitnone; it's learned only from generated data. Report now shows the reject-class vs confidence-gate comparison. Paired with the aegis wiring (Intent::None -> defer to Claude).