runner: add team mode (lead + members + shared task list + scratchpad) by ProKil · Pull Request #52 · cooperbench/CooperBench

ProKil · 2026-05-16T21:46:19Z

Summary

Adds a third setting (team) alongside solo and coop, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox, team mode gives them a coordinator role, a shared task list with atomic claim, a scratchpad volume, a typed request/response protocol, and (for CLI adapters) an MCP long-poll tool.

Stacks on #51. Includes the now-merged follow-ups from #53.

What's in team mode

Feature	What it is	Where it lives
Shared task list	Redis-backed (`HSETNX`-style atomic claim), owner-only updates, full audit log, namespaced per run	`_team/task_list.py` (host) + `_team/coop_task.py` (in-container CLI)
*`coop-task-` shell wrappers**	`create` / `claim` / `update` / `list` exposed at `/usr/local/bin/`	`_team/install_snippet.sh`
Lead + member roles	First agent gets a "you are the team-lead, organize via the task list before coding" prompt block; others get a "claim open tasks, report progress" block	`_team/prompt.py` (used by `build_team_instruction`)
Shared scratchpad	Docker volume mounted at `/workspace/shared/` in every container	`_team/runtime.py:scratchpad_mount_args`
Filesystem mirror of task list	`coop-task-list` (and the host runner) snapshots Redis to `/workspace/shared/tasks/<id>.json` + `_index.json` + `_log.jsonl` — `ls`-able artefacts like CC's on-disk teams	`_team/fs_mirror.py` + inlined copy in `_team/coop_task.py`
Typed `coop-request` / `coop-respond` protocol	Request/response with ID echo, mirroring CC's `plan_approval_request`/`response` shape. `await_response` uses `BLPOP`. Events flow into the shared task-log.	`_team/protocol.py` + `cmd_request`/`cmd_respond`/`cmd_pending` in coop_task.py
MCP long-poll server	Stdio JSON-RPC server exposing `wait_for_message(timeout_ms)`, backed by `BLPOP` on the agent's inbox. Closest we can get to push delivery for opaque CLI loops.	`_team/mcp_server.py`; registered into Claude/Codex configs by their adapters
In-loop task auto-refresh	`TeamPoller` injects a compact `[Team task list] open: 1, in_progress: 2 …` user message before every Python-loop step	`_team/loop_refresh.py` + hook in `mini_swe_agent_v2/agents/default.py:step()`
Coordination metrics	`time_to_first_claim_seconds`, `claims_per_agent`, `updates_per_agent`, `tasks_done`, `unowned_at_end` computed from the audit log and saved in `result.json`	`_team/metrics.py` + `runner/team.py`
`team/` log layout	Per-agent `agent{fid}.patch` + `agent{fid}_traj.json` + `conversation.json` + `task_log.json` + `tasks.json` + `result.json`	`runner/team.py`
`--setting team` CLI	Discovered by `cooperbench eval -n …` alongside `solo/` and `coop/`	`cli.py`, `runner/core.py`, `eval/runs.py`

Adapter compatibility

Team mode is fully wired only for claude_code and codex. The others accept the kwargs (so they don't crash) but coverage drops off:

Adapter	Accepts team kwargs	Team-mode prompt	In-loop auto-refresh	`coop-task-*` in container	MCP `wait_for_message`
`claude_code`	✓	✓ via `build_team_instruction`	n/a (opaque loop)	✓ installed by setup.sh	✓ registered in `.claude.json`
`codex`	✓	✓	n/a	✓	✓ registered in `config.toml`
`mini_swe_agent_v2`	✓	✗ — uses its own yaml templates that don't know about team mode	✓ via `TeamPoller` injected into `DefaultAgent.step()`	runs in container so the CLI would work if installed — but its setup.sh isn't team-aware	n/a
`swe_agent`	✓ then `del`	✗	✗	✗	n/a
`openhands_sdk`	✓ then `del`	✗	✗	✗	n/a

A future PR will close the gaps for the three Python-loop adapters: swap their templated prompt assembly for build_team_instruction calls, wire their setup steps to install coop-task-* in the agent container, and add the auto-refresh hook to swe_agent and openhands_sdk (mirroring what's already done for v2).

Tests

83 new unit tests in tests/agents/_team/ and tests/runner/test_team.py, all hermetic (use fakeredis).

Module	Coverage
`test_task_list.py` (18)	CRUD, atomic claim under contention, owner-only update, ValueError on invalid status, audit log records winning claims only, namespace isolation
`test_prompt.py` (15)	lead vs member branches, solo-fallback, single-agent degenerate, git interaction, regression: both prompts demand `patch.txt`
`test_runtime.py` (3)	env var assembly, scratchpad mount args, tasks-mirror dir
`test_metrics.py` (4)	happy path, unowned-at-end flagging, empty log, multiple claims aggregation
`test_fs_mirror.py` (8)	atomic writes, stale cleanup, empty-index case, deep mkdir
`test_protocol.py` (9)	request roundtrip, BLPOP await, timeout, audit log
`test_mcp_server.py` (9)	initialize, tools/list, tools/call, blocking, unknown-tool, env factory
`test_loop_refresh.py` (8)	summary formatting, TeamPoller, env-driven variant
`runner/test_team.py` (5)	lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team
misc + bumps	4

Full suite: 311 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end on `dottxt_ai_outlines_task/1371 [1,2]` with `claude_code --setting team --git`

5m44s, $1.66 total, both agents Submitted
Patches: agent1=249 lines, agent2=131 lines (cosmetic fix to _print_single_result so this displays correctly)
6 tasks created (2 by runner pre-seed, 4 by lead splitting its work), 4 reached done
time_to_first_claim_seconds = 29.9
claims_per_agent = {agent1: 2, agent2: 1}
updates_per_agent = {agent1: 5, agent2: 2}
/workspace/shared/tasks/ populated with per-task JSON + _index.json + _log.jsonl
/workspace/shared/agent2.patch written
MCP server cb-mcp-server.py registered in .claude.json
Eval: 2/2 features pass (14/14 + 20/20 tests)

The first team-mode end-to-end (before the prompt fix in #53) scored 0/2 on the same task because members wrote to scratchpad only. The ### Final submission — REQUIRED section in both lead and member prompts closes that loop.

Out of scope (real follow-ups for separate PRs)

Full team-mode support for the three Python-loop adapters (matrix above). Each has its own prompt-assembly story and setup-time story; doing all three properly is a bigger diff than I want to bundle here.
Filesystem-watch ("inotify-style") auto-refresh for the task mirror — currently snapshot-on-coop-task-list, which is good enough but not real push.
Adversarial test harnesses — inject deliberately conflicting feature pairs so we can score coordination under stress.

Test plan

ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
uv run cooperbench run -a claude_code -m claude-sonnet-4-5 -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend docker
uv run cooperbench eval -n <name> --backend docker
Inspect result.json:metrics, task_log.json, /workspace/shared/tasks/_index.json inside the team volume

🤖 Generated with Claude Code

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves conflicts in src/cooperbench/agents/{claude_code,codex}/{adapter.py,setup.sh}. team-mode's HEAD was already a superset of main for these files: it carried the pre-squash version of #51 plus the team-mode additions on top. Took --ours for all four; ruff / format / mypy / pytest all clean (311 passed, 63 integration skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Six test modules (tests/agents/_team/test_{fs_mirror,loop_refresh, mcp_server,protocol,task_list}.py and tests/runner/test_team.py) import fakeredis to mock Redis for the team-mode task list and messaging. fakeredis wasn't pinned in dev extras, so CI's `uv pip install -e ".[dev]"` left it missing and pytest collection failed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lockfile regenerated by `uv sync --extra dev` after pyproject.toml gained fakeredis. CI uses `uv pip install -e ".[dev]"` so the prior commit was enough to unblock GH Actions, but `uv sync --frozen` would have failed on team-mode without this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ubuntu and others added 2 commits May 16, 2026 20:42

ProKil mentioned this pull request May 17, 2026

team mode: fs mirror, typed protocol, MCP server, in-loop refresh #53

Merged

4 tasks

ProKil mentioned this pull request May 17, 2026

team mode: wire team prompt + env into the three Python-loop adapters #55

Open

4 tasks

Base automatically changed from codex-adapter to main May 18, 2026 00:16

ProKil requested a review from akhatua2 May 18, 2026 00:17

ProKil and others added 2 commits May 18, 2026 00:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runner: add team mode (lead + members + shared task list + scratchpad)#52

runner: add team mode (lead + members + shared task list + scratchpad)#52
ProKil wants to merge 6 commits into
mainfrom
team-mode

ProKil commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in team mode

Adapter compatibility

Tests

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code --setting team --git

Out of scope (real follow-ups for separate PRs)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ProKil commented May 16, 2026 •

edited

Loading

End-to-end on `dottxt_ai_outlines_task/1371 [1,2]` with `claude_code --setting team --git`