Skip to content

team mode: fs mirror, typed protocol, MCP server, in-loop refresh#53

Merged
ProKil merged 1 commit into
team-modefrom
team-mode-followups
May 17, 2026
Merged

team mode: fs mirror, typed protocol, MCP server, in-loop refresh#53
ProKil merged 1 commit into
team-modefrom
team-mode-followups

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented May 17, 2026

Summary

Lands the four follow-ups called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by that PR's end-to-end run.

  1. Filesystem mirror of task list (_team/fs_mirror.py) — snapshots Redis state to /workspace/shared/tasks/<id>.json + _index.json + _log.jsonl so agents can ls/cat tasks with their existing tools. Mirrors CC's team primitive. Atomic per-file writes via tempfile+replace.
  2. Typed coop-request / coop-respond protocol (_team/protocol.py) — request/response with ID echo, mirroring CC's plan_approval_request shape. Sender's await_response uses BLPOP so it actually sleeps. Both events flow into the shared task-log.
  3. MCP long-poll server (_team/mcp_server.py) — stdio JSON-RPC server exposing wait_for_message, backed by BLPOP on the agent's inbox. Registered in ~/.claude.json (Claude Code) and ~/.codex/config.toml (Codex). Closest we can get to push delivery for opaque CLI agent loops.
  4. In-loop task-list auto-refresh (_team/loop_refresh.py) — TeamPoller injects a compact [Team task list] open: 1, in_progress: 2 ... summary as a user message before every mini_swe_agent_v2.step() (same hook as the existing inbox poll). The LLM sees current state without remembering to call coop-task-list.
  5. Prompt fix — the previous team-mode e2e had members writing diffs to scratchpad only, scoring 0/2. Both lead and member prompts now include an explicit ### Final submission — REQUIRED section calling out patch.txt as the only file the bench evaluates.

Cosmetic fix to runner/core._print_single_result so team-mode per-agent dicts (which carry patch_lines: int) render correctly — previously the column showed 0.

Stacks on #52.

Tests

37 new unit tests, all hermetic (fakeredis):

Module Coverage
test_fs_mirror.py (8) atomic writes, stale cleanup, empty-index case, deep mkdir
test_protocol.py (9) request roundtrip, BLPOP await, timeout, audit log
test_mcp_server.py (9) initialize, tools/list, tools/call, blocking, unknown-tool, env factory
test_loop_refresh.py (8) summary formatting, TeamPoller, env-driven poll_team_state
test_prompt.py (+3) regression: lead+member prompts demand patch.txt

Full suite: 311 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end

Same task as the team-mode PR (dottxt_ai_outlines_task/1371 [1,2]) with Claude Code in team+git mode:

Run result: both Submitted in 5m44s, $1.66 total. Patches: agent1=249 lines, agent2=131 lines.
Coordination metrics: 6 tasks (2 pre-seed + 4 lead-split), 4 done, time_to_first_claim=29.9s, claims_per_agent={agent1: 2, agent2: 1}, updates_per_agent={agent1: 5, agent2: 2}.
Visible follow-up activity:

  • /workspace/shared/tasks/ populated with 5 per-task JSON files + _index.json + _log.jsonl
  • /workspace/shared/agent2.patch written by agent2 (4 KB)
  • MCP server registered in .claude.json
  • Task pre-seeds use the new owner field

Eval: 2/2 features pass (14/14 + 20/20 tests).

Compare to the team-mode-only PR (#52) on the same task: 0/2 then, 2/2 now — the prompt fix and finer-grained coordination tools together close the gap.

Files

Path Purpose
src/cooperbench/agents/_team/fs_mirror.py filesystem snapshot helper (host)
src/cooperbench/agents/_team/protocol.py typed request/response client (host)
src/cooperbench/agents/_team/mcp_server.py stdio MCP server + wait_for_message tool
src/cooperbench/agents/_team/loop_refresh.py TeamPoller for Python-loop adapters + env-driven variant
src/cooperbench/agents/_team/coop_task.py gained cmd_request / cmd_respond / cmd_pending + auto-mirror call
src/cooperbench/agents/_team/install_snippet.sh adds coop-request / coop-respond / coop-pending wrappers
src/cooperbench/agents/_team/prompt.py tightened with explicit patch.txt submission section
src/cooperbench/agents/_team/runtime.py exports CONTAINER_TASKS_MIRROR_DIR; sets CB_TEAM_TASKS_DIR env var
src/cooperbench/agents/claude_code/adapter.py drops MCP script + writes .claude.json for team mode
src/cooperbench/agents/codex/adapter.py drops MCP script + writes config.toml for team mode
src/cooperbench/agents/mini_swe_agent_v2/adapter.py constructs TeamPoller when team kwargs present
src/cooperbench/agents/mini_swe_agent_v2/agents/default.py step() calls team_poller.poll()
src/cooperbench/runner/core.py display fix for team-mode patch_lines

Test plan

  • ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
  • uv run cooperbench run -a claude_code -m claude-sonnet-4-5 -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend docker
  • uv run cooperbench eval -n <name> --backend docker
  • Inspect /workspace/shared/tasks/_index.json, result.json:metrics, .claude.json in container

🤖 Generated with Claude Code

…resh

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ProKil ProKil merged commit 379ee8f into team-mode May 17, 2026
@ProKil ProKil deleted the team-mode-followups branch May 17, 2026 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant