feat(eval): Agent Eval Toolchain — v0.18.0 milestone#779
Open
kovtcharov-amd wants to merge 1 commit intomainfrom
Open
feat(eval): Agent Eval Toolchain — v0.18.0 milestone#779kovtcharov-amd wants to merge 1 commit intomainfrom
kovtcharov-amd wants to merge 1 commit intomainfrom
Conversation
…, docs, tests Remove 15,879 lines of legacy eval code (eval.py, groundtruth.py, batch_experiment.py, transcript/email generators, fix_code_testbench, webapp, configs) and replace with the new agent eval benchmark framework. Extensibility: - --scenario-dir for custom scenario directories - --corpus-dir for custom corpus with manifest merging - --tag for scenario filtering (OR logic) - --output-format junit for CI/CD integration (JUnit XML) - Custom personas (any non-empty string accepted) Documentation: - docs/guides/eval.mdx — Getting Started guide - docs/guides/eval-scenarios.mdx — Scenario authoring reference - docs/guides/eval-ci.mdx — CI/CD integration with GitHub Actions - docs/reference/cli.mdx — Full eval agent CLI reference - docs/presentations/agent-eval-benchmark.html — 10-slide HTML deck Tests: - 27 test classes covering public API surface - Extensibility tests (custom dirs, tags, JUnit XML, personas) - Scorecard, runner, audit, corpus, CLI public API tests Closes #670, #671, #672, #673, #573 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
kovtcharov
added a commit
that referenced
this pull request
Apr 17, 2026
Two-phase local-first email triage agent — MVT (~1.5d CC-assisted) for v0.20.0, full EmailTriageAgent for v0.23.0. Covers auto-discovery, per-cohort autonomy, speech-act classification, undo ledger, Slack as first-class output channel, and an honest §27 catalog of research bets and unvalidated claims. §22.4 maps outstanding PRs to prerequisite role: #606 / #517 / #495 / #622 / #779 / #741 / #737. Landing the "minimum set" of #495 + #741 + one of #606 / #517 M1 collapses most of the missing-infrastructure workarounds before implementation starts.
4 tasks
github-merge-queue bot
pushed a commit
that referenced
this pull request
Apr 18, 2026
## Summary Adds a two-phase spec for a local-first email triage agent that runs inference on-device via Lemonade (Ryzen AI NPU/iGPU) — no email content transits a cloud API. Phase **MVT** ships in ~1.5 days (CC-assisted) by thin-wrapping existing primitives; **Phase C1** polishes UX for v0.20.0; **Phase C2** adds scheduled triage, Agent Inbox HITL, and in-tree Gmail MCP for v0.23.0. Slack is a first-class output channel from day one (webhook → MCP → interactive buttons across phases). ## Key threads - **MVT ships fast because ~95% of plumbing exists.** §2.5 maps every required capability to an existing GAIA primitive (`MCPClientMixin`, `DatabaseMixin`, `RAGSDK`, `TalkSDK`, `SummarizeAgent`, `ApiAgent`, SSE). Why it matters: scoping the MVT as thin wrappers rather than new plumbing is what makes the ~1.5d estimate credible. - **§22.4 catalogs in-flight PRs as prerequisites.** Maps [#606](#606) (memory v2), [#517](#517) (autonomy M1/M3/M5), [#495](#495) (security.py), [#622](#622) (orchestrator), [#779](#779) (eval), [#741](#741) (vault), [#737](#737) (Slack connector) to which spec risks each one collapses. Why it matters: the "minimum set to start MVT safely" is named explicitly — #495 + #741 + one of #606 / #517 M1 — so sequencing is actionable. - **Memory-PR conflict flagged (§22.4.4).** #606 and #517 M1 overlap on memory subsystem; §22.4.4 calls out the reconciliation as a prerequisite decision, not a runtime surprise. - **§27 "Known Weaknesses, Unvalidated Claims, Decision Debt"** names the research bets (Custom AI Labels on local 4B, per-relationship voice, auto-follow-up quality) and unvalidated claims cited in the spec (97.5% tool-call reliability, GongRzhe archive date, etc.) so C2 isn't treated as an engineering certainty. - **Slack integration scoped as an output channel (§12.18).** Webhook at MVT → Slack MCP at C1 → interactive approve/edit/reject buttons at C2. Aligned with [messaging-integrations-plan.mdx](https://github.com/amd/gaia/blob/main/docs/plans/messaging-integrations-plan.mdx) (#635). ## Test plan - [ ] Render preview of `docs/plans/email-triage-agent.mdx` via Mintlify dev or amd-gaia.ai preview — confirm frontmatter, tables, code blocks, and section numbering (1–28) render cleanly. - [ ] Verify `docs/docs.json` navigation entry places the page under *Agent UI* group next to `email-calendar-integration`. - [ ] Cross-reference check: every `[Link](file.mdx)` target exists (`email-calendar-integration`, `autonomy-engine`, `security-model`, `agent-ui`, `setup-wizard`, `messaging-integrations-plan`). - [ ] Scan §22.4 PR numbers against the current PR queue (`gh pr list --repo amd/gaia --state open`) to confirm they're still open and the recommended sequence is feasible.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships the complete Agent Eval Toolchain for the v0.18.0 milestone. This PR addresses all open issues in the Agent Eval Benchmark milestone:
--scenario-dir,--corpus-dir,--tag,--output-format junit, and custom persona supportdocs/presentations/)What Changed
Removed (legacy eval framework)
src/gaia/eval/eval.py(3,336 lines),groundtruth.py,batch_experiment.py,transcript_generator.py,email_generator.pysrc/gaia/eval/fix_code_testbench/,configs/,webapp/,scripts/gaia groundtruth,gaia report,gaia visualize,gaia create-template,gaia batch-experiment,gaia synthetic-dataAdded/Modified
runner.py— custom scenario/corpus dirs, tag filtering, custom personasscorecard.py— JUnit XML output (write_junit_xml())cli.py— new flags (--scenario-dir,--corpus-dir,--tag,--output-format), removed legacy commands (~1,900 lines)tests/test_eval.py— 27 test classes with full public API coveragedocs/guides/eval.mdx— Getting Started guide (341 lines)docs/guides/eval-scenarios.mdx— Scenario authoring reference (645 lines)docs/guides/eval-ci.mdx— CI/CD integration guide (500 lines)docs/presentations/agent-eval-benchmark.html— 10-slide presentation deckInfrastructure
.github/workflows/test_eval.yml— removed Node.js/webapp tests, added agent eval help test.github/dependabot.yml— removed eval webapp npm monitoringsetup.py,MANIFEST.in— removed deleted packages/includesCLAUDE.md— updated architecture referencesTest plan
pytest tests/test_eval.py -v— all 27 test classes passgaia eval agent --help— shows all new flagsgaia eval agent --audit-only— runs architecture audit without LLMgaia eval agent --scenario simple_factual_rag— runs single scenario (requires Agent UI backend + Claude API key)python -c "from gaia.eval.runner import AgentEvalRunner; from gaia.eval.scorecard import build_scorecard; print('ok')"Closes #670, #671, #672, #673, #573
🤖 Generated with Claude Code