Build an eval suite for AI config repair using Apple's Evaluations framework

**Post-1.0. Not for the 1.0 launch.** From the WWDC 2026 review — see `docs/wwdc-2026-macos-27-opportunities.md` (PR #911), opportunity #6.

## Why

KeyPath ships an LLM feature (`AnthropicConfigRepairService`) with no systematic quality measurement. WWDC26's **Evaluations framework** is built for validating AI features beyond unit tests.

A corpus of broken `.kbd` configs → expected repairs becomes:
- a regression gate for prompt/model changes in the repair feature, and
- the **prerequisite for confidently routing "easy" repairs to the on-device model** in the Foundation Models adoption issue — we can't tier routing without measuring per-model quality.

## Plan

1. Collect a corpus of real broken Kanata configs (from `docs/bugs/`, support reports, and synthesized syntax errors) with known-good repairs.
2. Wire it into the Evaluations framework once on the macOS 27 SDK; score the current Claude-based path as the baseline.
3. Score the on-device model against the same corpus to set the routing threshold.

https://claude.ai/code/session_01Nsiqm39oCwHkbrHytefnGM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build an eval suite for AI config repair using Apple's Evaluations framework #917

Why

Plan

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Build an eval suite for AI config repair using Apple's Evaluations framework #917

Description

Why

Plan

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions