Skip to content

feat(eval): Agent Eval Toolchain — v0.18.0 milestone#779

Open
kovtcharov-amd wants to merge 1 commit intomainfrom
kalin/agent-eval-toolchain
Open

feat(eval): Agent Eval Toolchain — v0.18.0 milestone#779
kovtcharov-amd wants to merge 1 commit intomainfrom
kalin/agent-eval-toolchain

Conversation

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

Summary

Ships the complete Agent Eval Toolchain for the v0.18.0 milestone. This PR addresses all open issues in the Agent Eval Benchmark milestone:

What Changed

Removed (legacy eval framework)

  • src/gaia/eval/eval.py (3,336 lines), groundtruth.py, batch_experiment.py, transcript_generator.py, email_generator.py
  • src/gaia/eval/fix_code_testbench/, configs/, webapp/, scripts/
  • CLI commands: gaia groundtruth, gaia report, gaia visualize, gaia create-template, gaia batch-experiment, gaia synthetic-data

Added/Modified

  • runner.py — custom scenario/corpus dirs, tag filtering, custom personas
  • scorecard.py — JUnit XML output (write_junit_xml())
  • cli.py — new flags (--scenario-dir, --corpus-dir, --tag, --output-format), removed legacy commands (~1,900 lines)
  • tests/test_eval.py — 27 test classes with full public API coverage
  • docs/guides/eval.mdx — Getting Started guide (341 lines)
  • docs/guides/eval-scenarios.mdx — Scenario authoring reference (645 lines)
  • docs/guides/eval-ci.mdx — CI/CD integration guide (500 lines)
  • docs/presentations/agent-eval-benchmark.html — 10-slide presentation deck

Infrastructure

  • .github/workflows/test_eval.yml — removed Node.js/webapp tests, added agent eval help test
  • .github/dependabot.yml — removed eval webapp npm monitoring
  • setup.py, MANIFEST.in — removed deleted packages/includes
  • CLAUDE.md — updated architecture references

Test plan

  • pytest tests/test_eval.py -v — all 27 test classes pass
  • gaia eval agent --help — shows all new flags
  • gaia eval agent --audit-only — runs architecture audit without LLM
  • gaia eval agent --scenario simple_factual_rag — runs single scenario (requires Agent UI backend + Claude API key)
  • No broken imports: python -c "from gaia.eval.runner import AgentEvalRunner; from gaia.eval.scorecard import build_scorecard; print('ok')"
  • Verify docs render at amd-gaia.ai/guides/eval after merge

Closes #670, #671, #672, #673, #573

🤖 Generated with Claude Code

…, docs, tests

Remove 15,879 lines of legacy eval code (eval.py, groundtruth.py,
batch_experiment.py, transcript/email generators, fix_code_testbench,
webapp, configs) and replace with the new agent eval benchmark framework.

Extensibility:
- --scenario-dir for custom scenario directories
- --corpus-dir for custom corpus with manifest merging
- --tag for scenario filtering (OR logic)
- --output-format junit for CI/CD integration (JUnit XML)
- Custom personas (any non-empty string accepted)

Documentation:
- docs/guides/eval.mdx — Getting Started guide
- docs/guides/eval-scenarios.mdx — Scenario authoring reference
- docs/guides/eval-ci.mdx — CI/CD integration with GitHub Actions
- docs/reference/cli.mdx — Full eval agent CLI reference
- docs/presentations/agent-eval-benchmark.html — 10-slide HTML deck

Tests:
- 27 test classes covering public API surface
- Extensibility tests (custom dirs, tags, JUnit XML, personas)
- Scorecard, runner, audit, corpus, CLI public API tests

Closes #670, #671, #672, #673, #573

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added documentation Documentation changes dependencies Dependency updates devops DevOps/infrastructure changes agents Agent system changes cli CLI changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Apr 14, 2026
kovtcharov added a commit that referenced this pull request Apr 17, 2026
Two-phase local-first email triage agent — MVT (~1.5d CC-assisted) for
v0.20.0, full EmailTriageAgent for v0.23.0. Covers auto-discovery, per-cohort
autonomy, speech-act classification, undo ledger, Slack as first-class output
channel, and an honest §27 catalog of research bets and unvalidated claims.

§22.4 maps outstanding PRs to prerequisite role: #606 / #517 / #495 / #622 /
#779 / #741 / #737. Landing the "minimum set" of #495 + #741 + one of #606 /
#517 M1 collapses most of the missing-infrastructure workarounds before
implementation starts.
github-merge-queue bot pushed a commit that referenced this pull request Apr 18, 2026
## Summary

Adds a two-phase spec for a local-first email triage agent that runs
inference on-device via Lemonade (Ryzen AI NPU/iGPU) — no email content
transits a cloud API. Phase **MVT** ships in ~1.5 days (CC-assisted) by
thin-wrapping existing primitives; **Phase C1** polishes UX for v0.20.0;
**Phase C2** adds scheduled triage, Agent Inbox HITL, and in-tree Gmail
MCP for v0.23.0. Slack is a first-class output channel from day one
(webhook → MCP → interactive buttons across phases).

## Key threads

- **MVT ships fast because ~95% of plumbing exists.** §2.5 maps every
required capability to an existing GAIA primitive (`MCPClientMixin`,
`DatabaseMixin`, `RAGSDK`, `TalkSDK`, `SummarizeAgent`, `ApiAgent`,
SSE). Why it matters: scoping the MVT as thin wrappers rather than new
plumbing is what makes the ~1.5d estimate credible.
- **§22.4 catalogs in-flight PRs as prerequisites.** Maps
[#606](#606) (memory v2),
[#517](#517) (autonomy M1/M3/M5),
[#495](#495) (security.py),
[#622](#622) (orchestrator),
[#779](#779) (eval),
[#741](#741) (vault),
[#737](#737) (Slack connector) to
which spec risks each one collapses. Why it matters: the "minimum set to
start MVT safely" is named explicitly — #495 + #741 + one of #606 / #517
M1 — so sequencing is actionable.
- **Memory-PR conflict flagged (§22.4.4).** #606 and #517 M1 overlap on
memory subsystem; §22.4.4 calls out the reconciliation as a prerequisite
decision, not a runtime surprise.
- **§27 "Known Weaknesses, Unvalidated Claims, Decision Debt"** names
the research bets (Custom AI Labels on local 4B, per-relationship voice,
auto-follow-up quality) and unvalidated claims cited in the spec (97.5%
tool-call reliability, GongRzhe archive date, etc.) so C2 isn't treated
as an engineering certainty.
- **Slack integration scoped as an output channel (§12.18).** Webhook at
MVT → Slack MCP at C1 → interactive approve/edit/reject buttons at C2.
Aligned with
[messaging-integrations-plan.mdx](https://github.com/amd/gaia/blob/main/docs/plans/messaging-integrations-plan.mdx)
(#635).

## Test plan

- [ ] Render preview of `docs/plans/email-triage-agent.mdx` via Mintlify
dev or amd-gaia.ai preview — confirm frontmatter, tables, code blocks,
and section numbering (1–28) render cleanly.
- [ ] Verify `docs/docs.json` navigation entry places the page under
*Agent UI* group next to `email-calendar-integration`.
- [ ] Cross-reference check: every `[Link](file.mdx)` target exists
(`email-calendar-integration`, `autonomy-engine`, `security-model`,
`agent-ui`, `setup-wizard`, `messaging-integrations-plan`).
- [ ] Scan §22.4 PR numbers against the current PR queue (`gh pr list
--repo amd/gaia --state open`) to confirm they're still open and the
recommended sequence is feasible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent system changes cli CLI changes dependencies Dependency updates devops DevOps/infrastructure changes documentation Documentation changes eval Evaluation framework changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent Eval: Documentation for third-party users — scenario authoring, custom scoring, CI/CD integration

1 participant