Skip to content

Build an eval suite for AI config repair using Apple's Evaluations framework #917

Description

@malpern

Post-1.0. Not for the 1.0 launch. From the WWDC 2026 review — see docs/wwdc-2026-macos-27-opportunities.md (PR #911), opportunity #6.

Why

KeyPath ships an LLM feature (AnthropicConfigRepairService) with no systematic quality measurement. WWDC26's Evaluations framework is built for validating AI features beyond unit tests.

A corpus of broken .kbd configs → expected repairs becomes:

  • a regression gate for prompt/model changes in the repair feature, and
  • the prerequisite for confidently routing "easy" repairs to the on-device model in the Foundation Models adoption issue — we can't tier routing without measuring per-model quality.

Plan

  1. Collect a corpus of real broken Kanata configs (from docs/bugs/, support reports, and synthesized syntax errors) with known-good repairs.
  2. Wire it into the Evaluations framework once on the macOS 27 SDK; score the current Claude-based path as the baseline.
  3. Score the on-device model against the same corpus to set the routing threshold.

https://claude.ai/code/session_01Nsiqm39oCwHkbrHytefnGM

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions