Skip to content

runner: add team mode (lead + members + shared task list + scratchpad)#52

Open
ProKil wants to merge 6 commits into
mainfrom
team-mode
Open

runner: add team mode (lead + members + shared task list + scratchpad)#52
ProKil wants to merge 6 commits into
mainfrom
team-mode

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented May 16, 2026

Summary

Adds a third setting (team) alongside solo and coop, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox, team mode gives them a coordinator role, a shared task list with atomic claim, a scratchpad volume, a typed request/response protocol, and (for CLI adapters) an MCP long-poll tool.

Stacks on #51. Includes the now-merged follow-ups from #53.

What's in team mode

Feature What it is Where it lives
Shared task list Redis-backed (HSETNX-style atomic claim), owner-only updates, full audit log, namespaced per run _team/task_list.py (host) + _team/coop_task.py (in-container CLI)
coop-task-* shell wrappers create / claim / update / list exposed at /usr/local/bin/ _team/install_snippet.sh
Lead + member roles First agent gets a "you are the team-lead, organize via the task list before coding" prompt block; others get a "claim open tasks, report progress" block _team/prompt.py (used by build_team_instruction)
Shared scratchpad Docker volume mounted at /workspace/shared/ in every container _team/runtime.py:scratchpad_mount_args
Filesystem mirror of task list coop-task-list (and the host runner) snapshots Redis to /workspace/shared/tasks/<id>.json + _index.json + _log.jsonlls-able artefacts like CC's on-disk teams _team/fs_mirror.py + inlined copy in _team/coop_task.py
Typed coop-request / coop-respond protocol Request/response with ID echo, mirroring CC's plan_approval_request/response shape. await_response uses BLPOP. Events flow into the shared task-log. _team/protocol.py + cmd_request/cmd_respond/cmd_pending in coop_task.py
MCP long-poll server Stdio JSON-RPC server exposing wait_for_message(timeout_ms), backed by BLPOP on the agent's inbox. Closest we can get to push delivery for opaque CLI loops. _team/mcp_server.py; registered into Claude/Codex configs by their adapters
In-loop task auto-refresh TeamPoller injects a compact [Team task list] open: 1, in_progress: 2 … user message before every Python-loop step _team/loop_refresh.py + hook in mini_swe_agent_v2/agents/default.py:step()
Coordination metrics time_to_first_claim_seconds, claims_per_agent, updates_per_agent, tasks_done, unowned_at_end computed from the audit log and saved in result.json _team/metrics.py + runner/team.py
team/ log layout Per-agent agent{fid}.patch + agent{fid}_traj.json + conversation.json + task_log.json + tasks.json + result.json runner/team.py
--setting team CLI Discovered by cooperbench eval -n … alongside solo/ and coop/ cli.py, runner/core.py, eval/runs.py

Adapter compatibility

Team mode is fully wired only for claude_code and codex. The others accept the kwargs (so they don't crash) but coverage drops off:

Adapter Accepts team kwargs Team-mode prompt In-loop auto-refresh coop-task-* in container MCP wait_for_message
claude_code ✓ via build_team_instruction n/a (opaque loop) ✓ installed by setup.sh ✓ registered in .claude.json
codex n/a ✓ registered in config.toml
mini_swe_agent_v2 — uses its own yaml templates that don't know about team mode ✓ via TeamPoller injected into DefaultAgent.step() runs in container so the CLI would work if installed — but its setup.sh isn't team-aware n/a
swe_agent ✓ then del n/a
openhands_sdk ✓ then del n/a

A future PR will close the gaps for the three Python-loop adapters: swap their templated prompt assembly for build_team_instruction calls, wire their setup steps to install coop-task-* in the agent container, and add the auto-refresh hook to swe_agent and openhands_sdk (mirroring what's already done for v2).

Tests

83 new unit tests in tests/agents/_team/ and tests/runner/test_team.py, all hermetic (use fakeredis).

Module Coverage
test_task_list.py (18) CRUD, atomic claim under contention, owner-only update, ValueError on invalid status, audit log records winning claims only, namespace isolation
test_prompt.py (15) lead vs member branches, solo-fallback, single-agent degenerate, git interaction, regression: both prompts demand patch.txt
test_runtime.py (3) env var assembly, scratchpad mount args, tasks-mirror dir
test_metrics.py (4) happy path, unowned-at-end flagging, empty log, multiple claims aggregation
test_fs_mirror.py (8) atomic writes, stale cleanup, empty-index case, deep mkdir
test_protocol.py (9) request roundtrip, BLPOP await, timeout, audit log
test_mcp_server.py (9) initialize, tools/list, tools/call, blocking, unknown-tool, env factory
test_loop_refresh.py (8) summary formatting, TeamPoller, env-driven variant
runner/test_team.py (5) lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team
misc + bumps 4

Full suite: 311 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code --setting team --git

  • 5m44s, $1.66 total, both agents Submitted
  • Patches: agent1=249 lines, agent2=131 lines (cosmetic fix to _print_single_result so this displays correctly)
  • 6 tasks created (2 by runner pre-seed, 4 by lead splitting its work), 4 reached done
  • time_to_first_claim_seconds = 29.9
  • claims_per_agent = {agent1: 2, agent2: 1}
  • updates_per_agent = {agent1: 5, agent2: 2}
  • /workspace/shared/tasks/ populated with per-task JSON + _index.json + _log.jsonl
  • /workspace/shared/agent2.patch written
  • MCP server cb-mcp-server.py registered in .claude.json
  • Eval: 2/2 features pass (14/14 + 20/20 tests)

The first team-mode end-to-end (before the prompt fix in #53) scored 0/2 on the same task because members wrote to scratchpad only. The ### Final submission — REQUIRED section in both lead and member prompts closes that loop.

Out of scope (real follow-ups for separate PRs)

  • Full team-mode support for the three Python-loop adapters (matrix above). Each has its own prompt-assembly story and setup-time story; doing all three properly is a bigger diff than I want to bundle here.
  • Filesystem-watch ("inotify-style") auto-refresh for the task mirror — currently snapshot-on-coop-task-list, which is good enough but not real push.
  • Adversarial test harnesses — inject deliberately conflicting feature pairs so we can score coordination under stress.

Test plan

  • ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
  • uv run cooperbench run -a claude_code -m claude-sonnet-4-5 -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend docker
  • uv run cooperbench eval -n <name> --backend docker
  • Inspect result.json:metrics, task_log.json, /workspace/shared/tasks/_index.json inside the team volume

🤖 Generated with Claude Code

Ubuntu and others added 2 commits May 16, 2026 20:42
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from codex-adapter to main May 18, 2026 00:16
@ProKil ProKil requested a review from akhatua2 May 18, 2026 00:17
Resolves conflicts in src/cooperbench/agents/{claude_code,codex}/{adapter.py,setup.sh}.
team-mode's HEAD was already a superset of main for these files: it
carried the pre-squash version of #51 plus the team-mode additions
on top. Took --ours for all four; ruff / format / mypy / pytest all
clean (311 passed, 63 integration skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil pushed a commit that referenced this pull request May 18, 2026
The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil and others added 2 commits May 18, 2026 00:24
Six test modules (tests/agents/_team/test_{fs_mirror,loop_refresh,
mcp_server,protocol,task_list}.py and tests/runner/test_team.py) import
fakeredis to mock Redis for the team-mode task list and messaging.
fakeredis wasn't pinned in dev extras, so CI's `uv pip install -e ".[dev]"`
left it missing and pytest collection failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lockfile regenerated by `uv sync --extra dev` after pyproject.toml
gained fakeredis. CI uses `uv pip install -e ".[dev]"` so the prior
commit was enough to unblock GH Actions, but `uv sync --frozen`
would have failed on team-mode without this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant