GitHub

Agentic eval framework that validates GAIA agent quality using Claude Code as user simulator + judge. The foundation for the eval→fine-tuning quality flywheel.

What Ships

Automated eval harness, ground truth test suite, quality metrics dashboard.

Use Cases Enabled

Validated agent reliability — every release tested against real workflows before shipping
Quality regression detection — catch tool-calling failures, hallucinations, format errors
Training data generation — eval results feed directly into v0.19.0 fine-tuning pipeline

Value Proposition

"Agents you can trust — every release is tested against real workflows. When something breaks, it's caught before it reaches you."

The Quality Flywheel

Eval runs → identifies failures → failures become training data (v0.19.0)
→ GRPO fine-tuning improves model → re-eval confirms improvement → repeat

Key Deliverables

Claude Code-based eval harness (user simulator + judge)
Ground truth test suite for core agent capabilities
Pass/fail metrics with tool trace analysis
See docs/plans/agent-ui-eval-benchmark.md for full spec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.18.0 — Agent Eval Benchmark [OSS]

What Ships

Use Cases Enabled

Value Proposition

The Quality Flywheel

Key Deliverables

Agent Eval: Replace existing eval framework

Agent Eval: Documentation for third-party users — scenario authoring, custom scoring, CI/CD integration

Agent Eval: Extensibility — plugin API for custom scenarios, scorers, and document types

Agent Eval: Clean up legacy eval framework (9,200 lines of dead code)

Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading)

Docs: Clarify GAIA's identity and audience paths across README, docs site, and quickstart

feat(eval): Agent Eval Toolchain — v0.18.0 milestone

v0.18.0 — Agent Eval Benchmark [OSS]

What Ships

Use Cases Enabled

Value Proposition

The Quality Flywheel

Key Deliverables

List view

Agent Eval: Replace existing eval framework

Agent Eval: Documentation for third-party users — scenario authoring, custom scoring, CI/CD integration

Agent Eval: Extensibility — plugin API for custom scenarios, scorers, and document types

Agent Eval: Clean up legacy eval framework (9,200 lines of dead code)

Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading)

Docs: Clarify GAIA's identity and audience paths across README, docs site, and quickstart

feat(eval): Agent Eval Toolchain — v0.18.0 milestone