Skip to content
Back to Milestones

v0.18.0 — Agent Eval Benchmark [OSS]

Open
Overdue by 3 day(s)
Due by April 21, 2026
Last updated Apr 14, 2026

Agentic eval framework that validates GAIA agent quality using Claude Code as user simulator + judge. The foundation for the eval→fine-tuning quality flywheel.

What Ships

Automated eval harness, ground truth test suite, quality metrics dashboard.

Use Cases Enabled

  • Validated agent reliability — every release tested against real workflows before shipping
  • Quality regression detection — catch tool-calling failures, hallucinations, format errors
  • Training data generation — eval results feed directly into v0.19.0 fine-tuning pipeline

Value Proposition

"Agents you can trust — every release is tested against real workflows. When something breaks, it's caught before it reaches you."

The Quality Flywheel

Eval runs → identifies failures → failures become training data (v0.19.0)
→ GRPO fine-tuning improves model → re-eval confirms improvement → repeat

Key Deliverables

  • Claude Code-based eval harness (user simulator + judge)
  • Ground truth test suite for core agent capabilities
  • Pass/fail metrics with tool trace analysis
  • See docs/plans/agent-ui-eval-benchmark.md for full spec
41% complete

List view