Skip to content

agents/codex: add Codex adapter; lift shared coop bits into _coop#51

Merged
ProKil merged 1 commit into
mainfrom
codex-adapter
May 18, 2026
Merged

agents/codex: add Codex adapter; lift shared coop bits into _coop#51
ProKil merged 1 commit into
mainfrom
codex-adapter

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented May 16, 2026

Summary

  • New codex agent adapter wrapping the official @openai/codex CLI; supports solo, coop (Redis messaging), and coop+git (shared team remote).
  • Lifted the agent-agnostic coop primitives (Redis messaging helper, prompt blocks, git remote setup, container env helpers) into cooperbench.agents._coop so the Claude Code adapter and the new Codex adapter share them.
  • Bug fix surfaced by Codex's solo run: .strip() on patch.txt was eating the trailing newline that git apply requires. Replaced with a normalize_patch() helper used by both adapters.
  • 24 new unit tests (Codex parser + adapter); full suite passes 228 / skipped 63.

Stacks on #50.

How Codex differs from Claude

  • CLI: codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id> -- "<prompt>"
  • Auth: writes ${CODEX_HOME}/auth.json with {"OPENAI_API_KEY": "..."} inside the container; only OPENAI_API_KEY is supported (no OAuth fallback today).
  • Cost: Codex doesn't emit a total_cost_usd field — AgentResult.cost is reported as 0.0 and the caller can compute from input_tokens / output_tokens / cache_read_tokens.
  • Steps: Codex emits one turn.completed event per session (not per LLM call), so steps=1 on a successful run is normal. We do not currently re-derive a per-call step count from item.completed events.
  • Model fallback: if Codex rejects --model gpt-5.5 with a "model not found" shaped error, the adapter retries once without --model so Codex falls back to its default.

Shared _coop module

  • coop_msg.py — same Redis-backed messaging helper as before.
  • install_snippet.shpip install redis + creates the coop-* shell wrappers; sourced by each adapter's setup.sh.
  • prompt.pybuild_instruction(task, agents, agent_id, git_enabled) produces the same prompt for any CLI agent.
  • runtime.pyContainerEnv protocol, build_environment, container file IO, URL rewriting, git setup command, normalize_patch().

End-to-end runs

Run Pass rate Notes
codex solo -f 1 n/a (eval needs pair) Submitted, 1 turn, 365k input / 5.6k output tokens, 184-line patch
codex coop -f 1,2 --git 0 / 2 Coordination failure: agent1 fetched team but never git merged; stacked patches produce SyntaxError: unmatched ')' at line 144 of the modified file. agent2 merged once. Claude on the same task scored 2/2 — Codex used the git+coop tools less aggressively.

The 0/2 is the kind of coordination failure CooperBench is designed to surface; the plumbing (messaging, git remote, fetch/merge tools) all worked. Iteration on the prompt or hard-enforcement of the merge step would be follow-up PRs.

Test plan

  • ruff check, ruff format --check, mypy, pytest tests/ (all green locally — 228 passed, 63 integration skipped).
  • uv run cooperbench run -a codex -m gpt-5.5 -r <repo> -t <task> -f <f1>,<f2> --setting coop --git --backend docker (requires OPENAI_API_KEY).
  • uv run cooperbench eval -n <name> --backend docker.

🤖 Generated with Claude Code

@ProKil ProKil requested a review from akhatua2 May 16, 2026 20:41
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@akhatua2 akhatua2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks for adding codex support!

Is there a way to add the --wait flag for comm? i.e pause my session until I hear back form the other agent?

In codex/claude it could be a prompt instruction maybe?

return {}


def _strip_provider_prefix(model_name: str) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need this function? Seems like a simple utils cleanup.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but how can we make all evals use the same api?

@ProKil
Copy link
Copy Markdown
Member Author

ProKil commented May 18, 2026

@akhatua2 thanks — reasonable ask, and I want to push back gently on the "prompt instruction" framing: a polling loop in prompt-land is brittle for weaker models and burns step budget (one poll = one step, and LimitsExceeded is real — see #54). The right fix is a real blocking primitive: coop-recv --wait --timeout 60 backed by Redis BLPOP, composable with existing send via &&.

Filed #56 with the full design — 60s default timeout, FIFO-from-anyone semantics (no correlation IDs yet), peer-shutdown sentinel so waiters wake when someone submits, plus the adapter-bash-timeout interaction we need to check before merging.

Keeping it out of this PR to keep the codex-adapter scope tight; we can land #56 as a follow-up that touches _coop/coop_msg.py and the per-adapter prompts.

@ProKil ProKil merged commit 93594db into main May 18, 2026
3 checks passed
@ProKil ProKil deleted the codex-adapter branch May 18, 2026 00:16
ProKil pushed a commit that referenced this pull request May 18, 2026
Resolves conflicts in src/cooperbench/agents/{claude_code,codex}/{adapter.py,setup.sh}.
team-mode's HEAD was already a superset of main for these files: it
carried the pre-squash version of #51 plus the team-mode additions
on top. Took --ours for all four; ruff / format / mypy / pytest all
clean (311 passed, 63 integration skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants