agents/codex: add Codex adapter; lift shared coop bits into _coop#51
Conversation
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akhatua2
left a comment
There was a problem hiding this comment.
This looks great. Thanks for adding codex support!
Is there a way to add the --wait flag for comm? i.e pause my session until I hear back form the other agent?
In codex/claude it could be a prompt instruction maybe?
| return {} | ||
|
|
||
|
|
||
| def _strip_provider_prefix(model_name: str) -> str: |
There was a problem hiding this comment.
Do we really need this function? Seems like a simple utils cleanup.
There was a problem hiding this comment.
Not really, but how can we make all evals use the same api?
|
@akhatua2 thanks — reasonable ask, and I want to push back gently on the "prompt instruction" framing: a polling loop in prompt-land is brittle for weaker models and burns step budget (one poll = one step, and Filed #56 with the full design — 60s default timeout, FIFO-from-anyone semantics (no correlation IDs yet), peer-shutdown sentinel so waiters wake when someone submits, plus the adapter-bash-timeout interaction we need to check before merging. Keeping it out of this PR to keep the codex-adapter scope tight; we can land #56 as a follow-up that touches |
Resolves conflicts in src/cooperbench/agents/{claude_code,codex}/{adapter.py,setup.sh}.
team-mode's HEAD was already a superset of main for these files: it
carried the pre-squash version of #51 plus the team-mode additions
on top. Took --ours for all four; ruff / format / mypy / pytest all
clean (311 passed, 63 integration skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
codexagent adapter wrapping the official@openai/codexCLI; supports solo, coop (Redis messaging), and coop+git (sharedteamremote).cooperbench.agents._coopso the Claude Code adapter and the new Codex adapter share them..strip()onpatch.txtwas eating the trailing newline thatgit applyrequires. Replaced with anormalize_patch()helper used by both adapters.Stacks on #50.
How Codex differs from Claude
codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id> -- "<prompt>"${CODEX_HOME}/auth.jsonwith{"OPENAI_API_KEY": "..."}inside the container; onlyOPENAI_API_KEYis supported (no OAuth fallback today).total_cost_usdfield —AgentResult.costis reported as0.0and the caller can compute frominput_tokens/output_tokens/cache_read_tokens.turn.completedevent per session (not per LLM call), sosteps=1on a successful run is normal. We do not currently re-derive a per-call step count fromitem.completedevents.--model gpt-5.5with a "model not found" shaped error, the adapter retries once without--modelso Codex falls back to its default.Shared
_coopmodulecoop_msg.py— same Redis-backed messaging helper as before.install_snippet.sh—pip install redis+ creates thecoop-*shell wrappers; sourced by each adapter'ssetup.sh.prompt.py—build_instruction(task, agents, agent_id, git_enabled)produces the same prompt for any CLI agent.runtime.py—ContainerEnvprotocol,build_environment, container file IO, URL rewriting, git setup command,normalize_patch().End-to-end runs
-f 1-f 1,2 --gitteambut nevergit merged; stacked patches produceSyntaxError: unmatched ')'at line 144 of the modified file. agent2 merged once. Claude on the same task scored 2/2 — Codex used the git+coop tools less aggressively.The 0/2 is the kind of coordination failure CooperBench is designed to surface; the plumbing (messaging, git remote, fetch/merge tools) all worked. Iteration on the prompt or hard-enforcement of the merge step would be follow-up PRs.
Test plan
ruff check,ruff format --check,mypy,pytest tests/(all green locally — 228 passed, 63 integration skipped).uv run cooperbench run -a codex -m gpt-5.5 -r <repo> -t <task> -f <f1>,<f2> --setting coop --git --backend docker(requiresOPENAI_API_KEY).uv run cooperbench eval -n <name> --backend docker.🤖 Generated with Claude Code