team mode: fs mirror, typed protocol, MCP server, in-loop refresh#53
Merged
Conversation
…resh Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the four follow-ups called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by that PR's end-to-end run.
_team/fs_mirror.py) — snapshots Redis state to/workspace/shared/tasks/<id>.json+_index.json+_log.jsonlso agents canls/cattasks with their existing tools. Mirrors CC's team primitive. Atomic per-file writes via tempfile+replace.coop-request/coop-respondprotocol (_team/protocol.py) — request/response with ID echo, mirroring CC'splan_approval_requestshape. Sender'sawait_responseuses BLPOP so it actually sleeps. Both events flow into the shared task-log._team/mcp_server.py) — stdio JSON-RPC server exposingwait_for_message, backed by BLPOP on the agent's inbox. Registered in~/.claude.json(Claude Code) and~/.codex/config.toml(Codex). Closest we can get to push delivery for opaque CLI agent loops._team/loop_refresh.py) —TeamPollerinjects a compact[Team task list] open: 1, in_progress: 2 ...summary as a user message before everymini_swe_agent_v2.step()(same hook as the existing inbox poll). The LLM sees current state without remembering to callcoop-task-list.### Final submission — REQUIREDsection calling outpatch.txtas the only file the bench evaluates.Cosmetic fix to
runner/core._print_single_resultso team-mode per-agent dicts (which carrypatch_lines: int) render correctly — previously the column showed 0.Stacks on #52.
Tests
37 new unit tests, all hermetic (fakeredis):
test_fs_mirror.py(8)test_protocol.py(9)test_mcp_server.py(9)test_loop_refresh.py(8)poll_team_statetest_prompt.py(+3)patch.txtFull suite: 311 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end
Same task as the team-mode PR (
dottxt_ai_outlines_task/1371 [1,2]) with Claude Code in team+git mode:Run result: both Submitted in 5m44s, $1.66 total. Patches: agent1=249 lines, agent2=131 lines.
Coordination metrics: 6 tasks (2 pre-seed + 4 lead-split), 4 done,
time_to_first_claim=29.9s,claims_per_agent={agent1: 2, agent2: 1},updates_per_agent={agent1: 5, agent2: 2}.Visible follow-up activity:
/workspace/shared/tasks/populated with 5 per-task JSON files +_index.json+_log.jsonl/workspace/shared/agent2.patchwritten by agent2 (4 KB).claude.jsonEval: 2/2 features pass (14/14 + 20/20 tests).
Compare to the team-mode-only PR (#52) on the same task: 0/2 then, 2/2 now — the prompt fix and finer-grained coordination tools together close the gap.
Files
src/cooperbench/agents/_team/fs_mirror.pysrc/cooperbench/agents/_team/protocol.pysrc/cooperbench/agents/_team/mcp_server.pywait_for_messagetoolsrc/cooperbench/agents/_team/loop_refresh.pyTeamPollerfor Python-loop adapters + env-driven variantsrc/cooperbench/agents/_team/coop_task.pycmd_request/cmd_respond/cmd_pending+ auto-mirror callsrc/cooperbench/agents/_team/install_snippet.shcoop-request/coop-respond/coop-pendingwrapperssrc/cooperbench/agents/_team/prompt.pypatch.txtsubmission sectionsrc/cooperbench/agents/_team/runtime.pyCONTAINER_TASKS_MIRROR_DIR; setsCB_TEAM_TASKS_DIRenv varsrc/cooperbench/agents/claude_code/adapter.py.claude.jsonfor team modesrc/cooperbench/agents/codex/adapter.pyconfig.tomlfor team modesrc/cooperbench/agents/mini_swe_agent_v2/adapter.pyTeamPollerwhen team kwargs presentsrc/cooperbench/agents/mini_swe_agent_v2/agents/default.pystep()callsteam_poller.poll()src/cooperbench/runner/core.pypatch_linesTest plan
ruff check,ruff format --check,mypy,pytest tests/(all green locally)uv run cooperbench run -a claude_code -m claude-sonnet-4-5 -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend dockeruv run cooperbench eval -n <name> --backend docker/workspace/shared/tasks/_index.json,result.json:metrics,.claude.jsonin container🤖 Generated with Claude Code