team mode: fs mirror, typed protocol, MCP server, in-loop refresh by ProKil · Pull Request #53 · cooperbench/CooperBench

ProKil · 2026-05-17T00:54:08Z

Summary

Lands the four follow-ups called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by that PR's end-to-end run.

Filesystem mirror of task list (_team/fs_mirror.py) — snapshots Redis state to /workspace/shared/tasks/<id>.json + _index.json + _log.jsonl so agents can ls/cat tasks with their existing tools. Mirrors CC's team primitive. Atomic per-file writes via tempfile+replace.
Typed coop-request / coop-respond protocol (_team/protocol.py) — request/response with ID echo, mirroring CC's plan_approval_request shape. Sender's await_response uses BLPOP so it actually sleeps. Both events flow into the shared task-log.
MCP long-poll server (_team/mcp_server.py) — stdio JSON-RPC server exposing wait_for_message, backed by BLPOP on the agent's inbox. Registered in ~/.claude.json (Claude Code) and ~/.codex/config.toml (Codex). Closest we can get to push delivery for opaque CLI agent loops.
In-loop task-list auto-refresh (_team/loop_refresh.py) — TeamPoller injects a compact [Team task list] open: 1, in_progress: 2 ... summary as a user message before every mini_swe_agent_v2.step() (same hook as the existing inbox poll). The LLM sees current state without remembering to call coop-task-list.
Prompt fix — the previous team-mode e2e had members writing diffs to scratchpad only, scoring 0/2. Both lead and member prompts now include an explicit ### Final submission — REQUIRED section calling out patch.txt as the only file the bench evaluates.

Cosmetic fix to runner/core._print_single_result so team-mode per-agent dicts (which carry patch_lines: int) render correctly — previously the column showed 0.

Stacks on #52.

Tests

37 new unit tests, all hermetic (fakeredis):

Module	Coverage
`test_fs_mirror.py` (8)	atomic writes, stale cleanup, empty-index case, deep mkdir
`test_protocol.py` (9)	request roundtrip, BLPOP await, timeout, audit log
`test_mcp_server.py` (9)	initialize, tools/list, tools/call, blocking, unknown-tool, env factory
`test_loop_refresh.py` (8)	summary formatting, TeamPoller, env-driven `poll_team_state`
`test_prompt.py` (+3)	regression: lead+member prompts demand `patch.txt`

Full suite: 311 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end

Same task as the team-mode PR (dottxt_ai_outlines_task/1371 [1,2]) with Claude Code in team+git mode:

Run result: both Submitted in 5m44s, $1.66 total. Patches: agent1=249 lines, agent2=131 lines.
Coordination metrics: 6 tasks (2 pre-seed + 4 lead-split), 4 done, time_to_first_claim=29.9s, claims_per_agent={agent1: 2, agent2: 1}, updates_per_agent={agent1: 5, agent2: 2}.
Visible follow-up activity:

/workspace/shared/tasks/ populated with 5 per-task JSON files + _index.json + _log.jsonl
/workspace/shared/agent2.patch written by agent2 (4 KB)
MCP server registered in .claude.json
Task pre-seeds use the new owner field

Eval: 2/2 features pass (14/14 + 20/20 tests).

Compare to the team-mode-only PR (#52) on the same task: 0/2 then, 2/2 now — the prompt fix and finer-grained coordination tools together close the gap.

Files

Path	Purpose
`src/cooperbench/agents/_team/fs_mirror.py`	filesystem snapshot helper (host)
`src/cooperbench/agents/_team/protocol.py`	typed request/response client (host)
`src/cooperbench/agents/_team/mcp_server.py`	stdio MCP server + `wait_for_message` tool
`src/cooperbench/agents/_team/loop_refresh.py`	`TeamPoller` for Python-loop adapters + env-driven variant
`src/cooperbench/agents/_team/coop_task.py`	gained `cmd_request` / `cmd_respond` / `cmd_pending` + auto-mirror call
`src/cooperbench/agents/_team/install_snippet.sh`	adds `coop-request` / `coop-respond` / `coop-pending` wrappers
`src/cooperbench/agents/_team/prompt.py`	tightened with explicit `patch.txt` submission section
`src/cooperbench/agents/_team/runtime.py`	exports `CONTAINER_TASKS_MIRROR_DIR`; sets `CB_TEAM_TASKS_DIR` env var
`src/cooperbench/agents/claude_code/adapter.py`	drops MCP script + writes `.claude.json` for team mode
`src/cooperbench/agents/codex/adapter.py`	drops MCP script + writes `config.toml` for team mode
`src/cooperbench/agents/mini_swe_agent_v2/adapter.py`	constructs `TeamPoller` when team kwargs present
`src/cooperbench/agents/mini_swe_agent_v2/agents/default.py`	`step()` calls `team_poller.poll()`
`src/cooperbench/runner/core.py`	display fix for team-mode `patch_lines`

Test plan

ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
uv run cooperbench run -a claude_code -m claude-sonnet-4-5 -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend docker
uv run cooperbench eval -n <name> --backend docker
Inspect /workspace/shared/tasks/_index.json, result.json:metrics, .claude.json in container

🤖 Generated with Claude Code

…resh Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ProKil merged commit 379ee8f into team-mode May 17, 2026

ProKil deleted the team-mode-followups branch May 17, 2026 06:06

ProKil mentioned this pull request May 17, 2026

runner: add team mode (lead + members + shared task list + scratchpad) #52

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

team mode: fs mirror, typed protocol, MCP server, in-loop refresh#53

team mode: fs mirror, typed protocol, MCP server, in-loop refresh#53
ProKil merged 1 commit into
team-modefrom
team-mode-followups

ProKil commented May 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented May 17, 2026

Summary

Tests

End-to-end

Files

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant