Post-1.0. Not for the 1.0 launch. From the WWDC 2026 review — see docs/wwdc-2026-macos-27-opportunities.md (PR #911), opportunity #6.
Why
KeyPath ships an LLM feature (AnthropicConfigRepairService) with no systematic quality measurement. WWDC26's Evaluations framework is built for validating AI features beyond unit tests.
A corpus of broken .kbd configs → expected repairs becomes:
- a regression gate for prompt/model changes in the repair feature, and
- the prerequisite for confidently routing "easy" repairs to the on-device model in the Foundation Models adoption issue — we can't tier routing without measuring per-model quality.
Plan
- Collect a corpus of real broken Kanata configs (from
docs/bugs/, support reports, and synthesized syntax errors) with known-good repairs.
- Wire it into the Evaluations framework once on the macOS 27 SDK; score the current Claude-based path as the baseline.
- Score the on-device model against the same corpus to set the routing threshold.
https://claude.ai/code/session_01Nsiqm39oCwHkbrHytefnGM
Post-1.0. Not for the 1.0 launch. From the WWDC 2026 review — see
docs/wwdc-2026-macos-27-opportunities.md(PR #911), opportunity #6.Why
KeyPath ships an LLM feature (
AnthropicConfigRepairService) with no systematic quality measurement. WWDC26's Evaluations framework is built for validating AI features beyond unit tests.A corpus of broken
.kbdconfigs → expected repairs becomes:Plan
docs/bugs/, support reports, and synthesized syntax errors) with known-good repairs.https://claude.ai/code/session_01Nsiqm39oCwHkbrHytefnGM