Conversation
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
Resolves conflicts in src/cooperbench/agents/{claude_code,codex}/{adapter.py,setup.sh}.
team-mode's HEAD was already a superset of main for these files: it
carried the pre-squash version of #51 plus the team-mode additions
on top. Took --ours for all four; ruff / format / mypy / pytest all
clean (311 passed, 63 integration skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil
pushed a commit
that referenced
this pull request
May 18, 2026
The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Six test modules (tests/agents/_team/test_{fs_mirror,loop_refresh,
mcp_server,protocol,task_list}.py and tests/runner/test_team.py) import
fakeredis to mock Redis for the team-mode task list and messaging.
fakeredis wasn't pinned in dev extras, so CI's `uv pip install -e ".[dev]"`
left it missing and pytest collection failed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lockfile regenerated by `uv sync --extra dev` after pyproject.toml gained fakeredis. CI uses `uv pip install -e ".[dev]"` so the prior commit was enough to unblock GH Actions, but `uv sync --frozen` would have failed on team-mode without this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a third setting (
team) alongsidesoloandcoop, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox, team mode gives them a coordinator role, a shared task list with atomic claim, a scratchpad volume, a typed request/response protocol, and (for CLI adapters) an MCP long-poll tool.Stacks on #51. Includes the now-merged follow-ups from #53.
What's in team mode
HSETNX-style atomic claim), owner-only updates, full audit log, namespaced per run_team/task_list.py(host) +_team/coop_task.py(in-container CLI)coop-task-*shell wrapperscreate/claim/update/listexposed at/usr/local/bin/_team/install_snippet.sh_team/prompt.py(used bybuild_team_instruction)/workspace/shared/in every container_team/runtime.py:scratchpad_mount_argscoop-task-list(and the host runner) snapshots Redis to/workspace/shared/tasks/<id>.json+_index.json+_log.jsonl—ls-able artefacts like CC's on-disk teams_team/fs_mirror.py+ inlined copy in_team/coop_task.pycoop-request/coop-respondprotocolplan_approval_request/responseshape.await_responseusesBLPOP. Events flow into the shared task-log._team/protocol.py+cmd_request/cmd_respond/cmd_pendingin coop_task.pywait_for_message(timeout_ms), backed byBLPOPon the agent's inbox. Closest we can get to push delivery for opaque CLI loops._team/mcp_server.py; registered into Claude/Codex configs by their adaptersTeamPollerinjects a compact[Team task list] open: 1, in_progress: 2 …user message before every Python-loop step_team/loop_refresh.py+ hook inmini_swe_agent_v2/agents/default.py:step()time_to_first_claim_seconds,claims_per_agent,updates_per_agent,tasks_done,unowned_at_endcomputed from the audit log and saved inresult.json_team/metrics.py+runner/team.pyteam/log layoutagent{fid}.patch+agent{fid}_traj.json+conversation.json+task_log.json+tasks.json+result.jsonrunner/team.py--setting teamCLIcooperbench eval -n …alongsidesolo/andcoop/cli.py,runner/core.py,eval/runs.pyAdapter compatibility
Team mode is fully wired only for
claude_codeandcodex. The others accept the kwargs (so they don't crash) but coverage drops off:coop-task-*in containerwait_for_messageclaude_codebuild_team_instruction.claude.jsoncodexconfig.tomlmini_swe_agent_v2TeamPollerinjected intoDefaultAgent.step()swe_agentdelopenhands_sdkdelA future PR will close the gaps for the three Python-loop adapters: swap their templated prompt assembly for
build_team_instructioncalls, wire their setup steps to installcoop-task-*in the agent container, and add the auto-refresh hook toswe_agentandopenhands_sdk(mirroring what's already done for v2).Tests
83 new unit tests in
tests/agents/_team/andtests/runner/test_team.py, all hermetic (usefakeredis).test_task_list.py(18)test_prompt.py(15)patch.txttest_runtime.py(3)test_metrics.py(4)test_fs_mirror.py(8)test_protocol.py(9)test_mcp_server.py(9)test_loop_refresh.py(8)runner/test_team.py(5)Full suite: 311 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on
dottxt_ai_outlines_task/1371 [1,2]withclaude_code --setting team --git_print_single_resultso this displays correctly)donetime_to_first_claim_seconds = 29.9claims_per_agent = {agent1: 2, agent2: 1}updates_per_agent = {agent1: 5, agent2: 2}/workspace/shared/tasks/populated with per-task JSON +_index.json+_log.jsonl/workspace/shared/agent2.patchwrittencb-mcp-server.pyregistered in.claude.jsonThe first team-mode end-to-end (before the prompt fix in #53) scored 0/2 on the same task because members wrote to scratchpad only. The
### Final submission — REQUIREDsection in both lead and member prompts closes that loop.Out of scope (real follow-ups for separate PRs)
coop-task-list, which is good enough but not real push.Test plan
ruff check,ruff format --check,mypy,pytest tests/(all green locally)uv run cooperbench run -a claude_code -m claude-sonnet-4-5 -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend dockeruv run cooperbench eval -n <name> --backend dockerresult.json:metrics,task_log.json,/workspace/shared/tasks/_index.jsoninside the team volume🤖 Generated with Claude Code