Agents that run the same in evaluation and in production — closing the gap between the two.
The eval ↔ prod gap · Agents · Tiers · Parity · Adapt a new agent · Contributing
Most teams build an agent for production (a CLI, a TUI, an app) and then re-implement or approximate it to evaluate it — a different harness, a different scaffold, different tool plumbing. So the benchmark measures something other than what ships, and the numbers don't transfer.
This repo's premise is the opposite: one agent, used both ways. Every agent here runs in production (an interactive TUI, the Vercel AI SDK, a coding CLI) and as an evaluation harness on BenchFlow over ACP — with no reimplementation. What you benchmark is what you ship. That's the gap we're closing.
And not just coding agents: mini-swe is a coding harness, but the Vercel AI SDK agents are general, tool-using agent frameworks (build any agent), and BenchFlow evaluations span well beyond code. The repo is a home for agents of any kind that you want to both ship and benchmark.
Every agent reaches BenchFlow through one of three paths, by how it's wrapped — each a top-level directory:
| Path | How an agent adapts | Agents in it | Eval on BenchFlow |
|---|---|---|---|
acp/ · ACP over stdio |
a declarative acp/<id>/manifest.toml (registry CLIs that already speak ACP — no code), or a self-contained ACP package |
the 36-agent ACP registry (goose, Qwen Code, Stakpak, GitHub Copilot, GLM, …) + mini-swe (SWE-agent behind opencode's TUI) + mimo (Xiaomi MiMo Code, native mimo acp) |
registry: 13 wired · 14 runnable · 6 native (tiers · AGENTS.md); mini-swe ✅ stable (>74% SWE-bench verified); mimo ✅ free mimo-auto runs headless |
ai-sdk/ · Vercel AI SDK |
a pure-JS server.mjs + register.py wrapping the AI SDK agent |
ToolLoopAgent (acp) and HarnessAgent × {pi, codex, claude-code, mimo, deepagents, opencode} — every official @ai-sdk/harness-* |
acp ✅ parity byte-verified · harness-pi ✅ (file tasks) · codex/claude-code 🧪 (Vercel sandbox) · deepagents/opencode 🧪 scaffolded (routing next step) — ai-sdk/README |
omnigent/ · non-ACP Session |
a per-harness session_factory that shells omnigent run --harness <x> in the sandbox |
Databricks Omnigent meta-harness — all 22 canonical harnesses (pi, claude, codex, cursor, opencode, hermes, openai-agents, goose, qwen, kimi, copilot, antigravity + their -native drivers); the only non-ACP path |
omnigent-pi ✅ reward 1.0 on hello-world and citation-check; the other 21 🧪 listed-not-wired (harness CLI install = next step); needs the session-factory seam — omnigent/README |
Within those paths an agent takes one of two shapes: a few self-contained
packages (the mini-swe-* runtimes, the ai-sdk/* group, omnigent) — a
production runtime + a thin adapter registered via the public register_agent
extension point — or a declarative acp/<id>/manifest.toml agent for an
ACP-registry CLI that already speaks ACP (no adapter code; discovered via the
manifest loader, classified by tier in acp-registry).
Not every agent can be benchmarked the same way. We classify each by how much of a run BenchFlow captures — the bar for adaptation is the floor (BenchFlow can create the experiment and track the run's logs); the tier says how much is captured above it. Full reference in docs/tiers.md; live per-agent table in acp-registry/AGENTS.md.
| Tier | What BenchFlow tracks |
|---|---|
| ✅ wired + 🟦 native | raw-LLM trajectory (gateway proxy) and ACP-trajectory logs — model-enforced, wire-parity-verifiable |
| 🏃 runnable | ACP-trajectory logs only — the model runs on the agent's own/vendor backend, not gateway-enforced (executable, not a faithful model-enforced eval) |
| 📋 catalog / 🔒 vendor-locked / ➖ out-of-scope | not adapted — a wiring to-do (recipe or block recorded), a backend-locked CLI, or a non-single-model agent |
Snapshot v1.0.0: wired 13 · runnable 14 · catalog 1 · native 6 · vendor-locked 1
· out-of-scope 1 (36 total) — AGENTS.md is authoritative.
The point isn't just "it runs in both places" — it's that it behaves the same
in both. For ai-sdk/acp this is verified at the wire level: driven inside
BenchFlow vs. standalone, the upstream model request is byte-identical (same
system+user prompt, tools, sampling params) apart from neutral gateway artifacts;
tool-use, file output, reward, and finish reason match. BenchFlow provides the
environment and captures the trajectory — it does not perturb the agent. The
adaptation-parity skill automates this check;
methodology in docs/parity.md.
Wire parity is the bar for the wired and native tiers (the model is proxied through BenchFlow's gateway, so there's a request to byte-diff); runnable agents run the model on their own backend, so they're held to outcome parity only.
Honesty matters more than a green checkmark. Toy tasks (a single file write) pass easily and prove little; real eval workloads — input files, real toolchains (
pytest, network), skills — expose the gaps. No agent here is "verified" beyond its quickstart; e.g.harness-pipasses hello-world but not the real SkillsBenchcitation-check. We need more tasks, of more variants, run end-to-end. Each package README states plainly what it has and hasn't been run against.
The bar is BenchFlow can create the experiment and track the run's logs; how much it captures sets the tier. Two routes:
- Already speaks ACP (most registry CLIs) — add a declarative
acp/<id>/manifest.tomland classify it inacp-registry; no adapter code. Thecontract/tests validate the manifest. (docs/adaptation.md) - Needs an adapter (a TUI, an SDK agent, a non-ACP harness) — write an ACP
server +
register.py. Scaffold fromai-sdk/acp:python skills/adaptation-parity/scripts/scaffold_ai_sdk_agent.py <name>.
Then verify parity — inside vs. standalone, with the skill's acp_capture.mjs +
parity_diff.py (wired/native; docs/parity.md).
Drive an agent interactively (mini-swe-code):
cd acp/mini-swe-code
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[opencode]"
export ANTHROPIC_API_KEY="<your-key>" # or OPENAI_API_KEY / GEMINI_API_KEY / ...
mkdir -p /tmp/mini-swe-scratch
mini-opencode --attach --cwd /tmp/mini-swe-scratchNote
The agent executes commands locally without confirmation inside --cwd —
point it at a scratch directory.
Benchmark an agent (mini-swe-acp):
pip install "mini-swe-acp @ git+https://github.com/benchflow-ai/agents#subdirectory=acp/mini-swe-acp"import mini_swe_acp # registers "mini-swe" (aliases: mini, minisweagent, mini-swe-agent)
from benchflow import SDK
await SDK().run(task_path="...", agent="mini-swe", model="openai/gpt-4o-mini")Per-agent setup, design notes, and caveats live in each package's README: mini-swe-code · mini-swe-acp · ai-sdk/*. For the ACP-registry agents, the generated acp-registry/AGENTS.md lists every agent's tier, wiring recipe, and known issues.
acp/ all ACP agents — 2 self-contained packages + 38 declarative manifests:
mini-swe-code/ mini-swe-agent distribution + opencode TUI (CLIs: mini, mini-opencode)
mini-swe-acp/ mini-swe-agent as a BenchFlow ACP agent
<id>/manifest.toml declarative registry agents (goose, qwen-code, …) + shims (mimo, …), no server code
acp-registry/ classifies the 36 ACP-registry agents into 6 tiers (catalog.py → AGENTS.md)
ai-sdk/ Vercel AI SDK agents: acp + harness-{pi,codex,claude-code,mimo,deepagents,opencode}
omnigent/ Databricks Omnigent meta-harness: 22 non-ACP (session-factory) agents, one per --harness
contract/ versioned manifest schema + loader + contract tests (validates acp/<id>/manifest.toml)
skills/ adaptation-parity skill — adapt an agent + verify eval/prod parity
docs/ adaptation.md, parity.md, tiers.md, CONTEXT.md, adr/ (0001–0003)
.github/ CI: per-family tests (path-filtered), ruff lint, markdown link check
Add an agent either way: a declarative acp/<id>/manifest.toml + a
catalog.py tier entry (validated by the
contract/ tests), or a self-contained package + a per-package CI workflow (or extend
the ai-sdk matrix). Each package still builds, tests, and ships independently.
Issues and PRs welcome — see CONTRIBUTING.md. High-value now:
- More benchmark tasks, of more variants — input-file / real-toolchain
(
pytest/build) / skill-based — run end-to-end against each agent to find and close eval↔prod behavior gaps. - New agent integrations (any production agent + a thin ACP adapter; scaffold + verify with the adaptation-parity skill).
- Parity reports: same agent, inside-BenchFlow vs. standalone, audited.
| Path | License |
|---|---|
repository root, acp/mini-swe-acp/, acp-registry/, ai-sdk/, omnigent/, contract/, skills/ |
Apache-2.0 |
acp/mini-swe-code/ |
MIT (upstream mini-swe-agent license, kept verbatim) |
acp/<id>/ manifest agents |
each keeps its upstream license — recorded per entry in acp-registry/AGENTS.md |
- mini-swe-agent and the SWE-bench / SWE-agent team. If useful in research, cite their SWE-agent paper.
- Vercel AI SDK — the toolkit behind the
ai-sdkagents. - opencode — the TUI that makes the agent a pleasure to drive.
- Agent Client Protocol — the editor/agent-agnostic protocol the eval shims speak.
- ACP registry — the source of the 36 ACP agents the
acp-registrypackage maps onto BenchFlow.