feat(eval): Agent Eval Toolchain — v0.18.0 milestone by kovtcharov-amd · Pull Request #779 · amd/gaia

kovtcharov-amd · 2026-04-14T14:45:31Z

Summary

Ships the complete Agent Eval Toolchain for the v0.18.0 milestone. This PR addresses all open issues in the Agent Eval Benchmark milestone:

Agent Eval: Clean up legacy eval framework (9,200 lines of dead code) #672 + Agent Eval: Replace existing eval framework #573 — Legacy cleanup: Removed 15,879 lines of dead code (old Evaluator, groundtruth generator, batch experiment runner, transcript/email generators, fix-code testbench, Express.js webapp, configs)
Agent Eval: Documentation for third-party users — scenario authoring, custom scoring, CI/CD integration #670 — Documentation: Created 3 new guides (Getting Started, Scenario Authoring, CI/CD Integration), updated CLI reference and docs.json navigation
Agent Eval: Extensibility — plugin API for custom scenarios, scorers, and document types #671 — Extensibility API: Added --scenario-dir, --corpus-dir, --tag, --output-format junit, and custom persona support
Agent Eval: Test coverage for public API surface (runner, scorecard, scenario loading) #673 — Test coverage: 27 test classes covering the full public API surface (scenario loading, runner, scorecard, corpus, CLI, audit)
Presentation: 10-slide HTML deck introducing Agent Eval (docs/presentations/)

What Changed

Removed (legacy eval framework)

src/gaia/eval/eval.py (3,336 lines), groundtruth.py, batch_experiment.py, transcript_generator.py, email_generator.py
src/gaia/eval/fix_code_testbench/, configs/, webapp/, scripts/
CLI commands: gaia groundtruth, gaia report, gaia visualize, gaia create-template, gaia batch-experiment, gaia synthetic-data

Added/Modified

runner.py — custom scenario/corpus dirs, tag filtering, custom personas
scorecard.py — JUnit XML output (write_junit_xml())
cli.py — new flags (--scenario-dir, --corpus-dir, --tag, --output-format), removed legacy commands (~1,900 lines)
tests/test_eval.py — 27 test classes with full public API coverage
docs/guides/eval.mdx — Getting Started guide (341 lines)
docs/guides/eval-scenarios.mdx — Scenario authoring reference (645 lines)
docs/guides/eval-ci.mdx — CI/CD integration guide (500 lines)
docs/presentations/agent-eval-benchmark.html — 10-slide presentation deck

Infrastructure

.github/workflows/test_eval.yml — removed Node.js/webapp tests, added agent eval help test
.github/dependabot.yml — removed eval webapp npm monitoring
setup.py, MANIFEST.in — removed deleted packages/includes
CLAUDE.md — updated architecture references

Test plan

pytest tests/test_eval.py -v — all 27 test classes pass
gaia eval agent --help — shows all new flags
gaia eval agent --audit-only — runs architecture audit without LLM
gaia eval agent --scenario simple_factual_rag — runs single scenario (requires Agent UI backend + Claude API key)
No broken imports: python -c "from gaia.eval.runner import AgentEvalRunner; from gaia.eval.scorecard import build_scorecard; print('ok')"
Verify docs render at amd-gaia.ai/guides/eval after merge

Closes #670, #671, #672, #673, #573

🤖 Generated with Claude Code

…, docs, tests Remove 15,879 lines of legacy eval code (eval.py, groundtruth.py, batch_experiment.py, transcript/email generators, fix_code_testbench, webapp, configs) and replace with the new agent eval benchmark framework. Extensibility: - --scenario-dir for custom scenario directories - --corpus-dir for custom corpus with manifest merging - --tag for scenario filtering (OR logic) - --output-format junit for CI/CD integration (JUnit XML) - Custom personas (any non-empty string accepted) Documentation: - docs/guides/eval.mdx — Getting Started guide - docs/guides/eval-scenarios.mdx — Scenario authoring reference - docs/guides/eval-ci.mdx — CI/CD integration with GitHub Actions - docs/reference/cli.mdx — Full eval agent CLI reference - docs/presentations/agent-eval-benchmark.html — 10-slide HTML deck Tests: - 27 test classes covering public API surface - Extensibility tests (custom dirs, tags, JUnit XML, personas) - Scorecard, runner, audit, corpus, CLI public API tests Closes #670, #671, #672, #673, #573 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two-phase local-first email triage agent — MVT (~1.5d CC-assisted) for v0.20.0, full EmailTriageAgent for v0.23.0. Covers auto-discovery, per-cohort autonomy, speech-act classification, undo ledger, Slack as first-class output channel, and an honest §27 catalog of research bets and unvalidated claims. §22.4 maps outstanding PRs to prerequisite role: #606 / #517 / #495 / #622 / #779 / #741 / #737. Landing the "minimum set" of #495 + #741 + one of #606 / #517 M1 collapses most of the missing-infrastructure workarounds before implementation starts.

## Summary Adds a two-phase spec for a local-first email triage agent that runs inference on-device via Lemonade (Ryzen AI NPU/iGPU) — no email content transits a cloud API. Phase **MVT** ships in ~1.5 days (CC-assisted) by thin-wrapping existing primitives; **Phase C1** polishes UX for v0.20.0; **Phase C2** adds scheduled triage, Agent Inbox HITL, and in-tree Gmail MCP for v0.23.0. Slack is a first-class output channel from day one (webhook → MCP → interactive buttons across phases). ## Key threads - **MVT ships fast because ~95% of plumbing exists.** §2.5 maps every required capability to an existing GAIA primitive (`MCPClientMixin`, `DatabaseMixin`, `RAGSDK`, `TalkSDK`, `SummarizeAgent`, `ApiAgent`, SSE). Why it matters: scoping the MVT as thin wrappers rather than new plumbing is what makes the ~1.5d estimate credible. - **§22.4 catalogs in-flight PRs as prerequisites.** Maps [#606](#606) (memory v2), [#517](#517) (autonomy M1/M3/M5), [#495](#495) (security.py), [#622](#622) (orchestrator), [#779](#779) (eval), [#741](#741) (vault), [#737](#737) (Slack connector) to which spec risks each one collapses. Why it matters: the "minimum set to start MVT safely" is named explicitly — #495 + #741 + one of #606 / #517 M1 — so sequencing is actionable. - **Memory-PR conflict flagged (§22.4.4).** #606 and #517 M1 overlap on memory subsystem; §22.4.4 calls out the reconciliation as a prerequisite decision, not a runtime surprise. - **§27 "Known Weaknesses, Unvalidated Claims, Decision Debt"** names the research bets (Custom AI Labels on local 4B, per-relationship voice, auto-follow-up quality) and unvalidated claims cited in the spec (97.5% tool-call reliability, GongRzhe archive date, etc.) so C2 isn't treated as an engineering certainty. - **Slack integration scoped as an output channel (§12.18).** Webhook at MVT → Slack MCP at C1 → interactive approve/edit/reject buttons at C2. Aligned with [messaging-integrations-plan.mdx](https://github.com/amd/gaia/blob/main/docs/plans/messaging-integrations-plan.mdx) (#635). ## Test plan - [ ] Render preview of `docs/plans/email-triage-agent.mdx` via Mintlify dev or amd-gaia.ai preview — confirm frontmatter, tables, code blocks, and section numbering (1–28) render cleanly. - [ ] Verify `docs/docs.json` navigation entry places the page under *Agent UI* group next to `email-calendar-integration`. - [ ] Cross-reference check: every `[Link](file.mdx)` target exists (`email-calendar-integration`, `autonomy-engine`, `security-model`, `agent-ui`, `setup-wizard`, `messaging-integrations-plan`). - [ ] Scan §22.4 PR numbers against the current PR queue (`gh pr list --repo amd/gaia --state open`) to confirm they're still open and the recommended sequence is feasible.

kovtcharov-amd added this to the v0.18.0 — Agent Eval Benchmark [OSS] milestone Apr 14, 2026

github-actions bot added documentation Documentation changes dependencies Dependency updates devops DevOps/infrastructure changes agents Agent system changes cli CLI changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Apr 14, 2026

kovtcharov mentioned this pull request Apr 17, 2026

docs(plans): add email triage agent spec #796

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): Agent Eval Toolchain — v0.18.0 milestone#779

feat(eval): Agent Eval Toolchain — v0.18.0 milestone#779
kovtcharov-amd wants to merge 1 commit intomainfrom
kalin/agent-eval-toolchain

kovtcharov-amd commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kovtcharov-amd commented Apr 14, 2026

Summary

What Changed

Removed (legacy eval framework)

Added/Modified

Infrastructure

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant