agents/codex: add Codex adapter; lift shared coop bits into _coop by ProKil · Pull Request #51 · cooperbench/CooperBench

ProKil · 2026-05-16T20:41:13Z

Summary

New codex agent adapter wrapping the official @openai/codex CLI; supports solo, coop (Redis messaging), and coop+git (shared team remote).
Lifted the agent-agnostic coop primitives (Redis messaging helper, prompt blocks, git remote setup, container env helpers) into cooperbench.agents._coop so the Claude Code adapter and the new Codex adapter share them.
Bug fix surfaced by Codex's solo run: .strip() on patch.txt was eating the trailing newline that git apply requires. Replaced with a normalize_patch() helper used by both adapters.
24 new unit tests (Codex parser + adapter); full suite passes 228 / skipped 63.

Stacks on #50.

How Codex differs from Claude

CLI: codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id> -- "<prompt>"
Auth: writes ${CODEX_HOME}/auth.json with {"OPENAI_API_KEY": "..."} inside the container; only OPENAI_API_KEY is supported (no OAuth fallback today).
Cost: Codex doesn't emit a total_cost_usd field — AgentResult.cost is reported as 0.0 and the caller can compute from input_tokens / output_tokens / cache_read_tokens.
Steps: Codex emits one turn.completed event per session (not per LLM call), so steps=1 on a successful run is normal. We do not currently re-derive a per-call step count from item.completed events.
Model fallback: if Codex rejects --model gpt-5.5 with a "model not found" shaped error, the adapter retries once without --model so Codex falls back to its default.

Shared `_coop` module

coop_msg.py — same Redis-backed messaging helper as before.
install_snippet.sh — pip install redis + creates the coop-* shell wrappers; sourced by each adapter's setup.sh.
prompt.py — build_instruction(task, agents, agent_id, git_enabled) produces the same prompt for any CLI agent.
runtime.py — ContainerEnv protocol, build_environment, container file IO, URL rewriting, git setup command, normalize_patch().

End-to-end runs

Run	Pass rate	Notes
codex solo `-f 1`	n/a (eval needs pair)	Submitted, 1 turn, 365k input / 5.6k output tokens, 184-line patch
codex coop `-f 1,2 --git`	0 / 2	Coordination failure: agent1 fetched `team` but never `git merge`d; stacked patches produce `SyntaxError: unmatched ')'` at line 144 of the modified file. agent2 merged once. Claude on the same task scored 2/2 — Codex used the git+coop tools less aggressively.

The 0/2 is the kind of coordination failure CooperBench is designed to surface; the plumbing (messaging, git remote, fetch/merge tools) all worked. Iteration on the prompt or hard-enforcement of the merge step would be follow-up PRs.

Test plan

ruff check, ruff format --check, mypy, pytest tests/ (all green locally — 228 passed, 63 integration skipped).
uv run cooperbench run -a codex -m gpt-5.5 -r <repo> -t <task> -f <f1>,<f2> --setting coop --git --backend docker (requires OPENAI_API_KEY).
uv run cooperbench eval -n <name> --backend docker.

🤖 Generated with Claude Code

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akhatua2

This looks great. Thanks for adding codex support!

Is there a way to add the --wait flag for comm? i.e pause my session until I hear back form the other agent?

In codex/claude it could be a prompt instruction maybe?

akhatua2 · 2026-05-16T22:41:44Z

+    return {}
+
+
+def _strip_provider_prefix(model_name: str) -> str:


Do we really need this function? Seems like a simple utils cleanup.

Not really, but how can we make all evals use the same api?

ProKil · 2026-05-18T00:15:15Z

@akhatua2 thanks — reasonable ask, and I want to push back gently on the "prompt instruction" framing: a polling loop in prompt-land is brittle for weaker models and burns step budget (one poll = one step, and LimitsExceeded is real — see #54). The right fix is a real blocking primitive: coop-recv --wait --timeout 60 backed by Redis BLPOP, composable with existing send via &&.

Filed #56 with the full design — 60s default timeout, FIFO-from-anyone semantics (no correlation IDs yet), peer-shutdown sentinel so waiters wake when someone submits, plus the adapter-bash-timeout interaction we need to check before merging.

Keeping it out of this PR to keep the codex-adapter scope tight; we can land #56 as a follow-up that touches _coop/coop_msg.py and the per-adapter prompts.

Resolves conflicts in src/cooperbench/agents/{claude_code,codex}/{adapter.py,setup.sh}. team-mode's HEAD was already a superset of main for these files: it carried the pre-squash version of #51 plus the team-mode additions on top. Took --ours for all four; ruff / format / mypy / pytest all clean (311 passed, 63 integration skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ProKil requested a review from akhatua2 May 16, 2026 20:41

ProKil force-pushed the codex-adapter branch from 4f0410e to 694883a Compare May 16, 2026 20:42

ProKil mentioned this pull request May 16, 2026

runner: add team mode (lead + members + shared task list + scratchpad) #52

Open

4 tasks

akhatua2 approved these changes May 16, 2026

View reviewed changes

ProKil mentioned this pull request May 18, 2026

comm: add coop-recv --wait for blocking inbox reads #56

Open

ProKil merged commit 93594db into main May 18, 2026
3 checks passed

ProKil deleted the codex-adapter branch May 18, 2026 00:16

ProKil mentioned this pull request May 18, 2026

team mode: wire team prompt + env into the three Python-loop adapters #55

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agents/codex: add Codex adapter; lift shared coop bits into _coop#51

agents/codex: add Codex adapter; lift shared coop bits into _coop#51
ProKil merged 1 commit into
mainfrom
codex-adapter

ProKil commented May 16, 2026

Uh oh!

akhatua2 left a comment

Uh oh!

akhatua2 May 16, 2026

Uh oh!

ProKil May 17, 2026

Uh oh!

ProKil commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return {}


		def _strip_provider_prefix(model_name: str) -> str:

Conversation

ProKil commented May 16, 2026

Summary

How Codex differs from Claude

Shared _coop module

End-to-end runs

Test plan

Uh oh!

akhatua2 left a comment

Choose a reason for hiding this comment

Uh oh!

akhatua2 May 16, 2026

Choose a reason for hiding this comment

Uh oh!

ProKil May 17, 2026

Choose a reason for hiding this comment

Uh oh!

ProKil commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shared `_coop` module