feat(adversarial-review): make self-collect path multi-turn so it can actually work#328
feat(adversarial-review): make self-collect path multi-turn so it can actually work#328kingdoooo wants to merge 22 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review findings on the queue-driven harness: - close() now splices its own binDir out of PATH so handles can be closed in any order - Collapse redundant saveState calls in the queue-driven branch - Document the requests getter's per-access I/O cost Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review finding: the soft-error test only checked truthiness of result.error. The fixture queues a deterministic message, so match against it explicitly to catch regressions where the error gets swallowed or replaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gation Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rtion The argument-hint assertion in commands.test.mjs hard-coded a contiguous match of "[--scope ...] [focus ...]". Splitting that into separate matches lets the new flag sit between the two without breaking the test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two transport-error returns in runAppServerInvestigation were emitting
{ code: 1, kind: "error" } as `status`, while the happy path and
runAppServerTurn always return a number. Downstream payload.codex.status
inherited this object on error paths but a number elsewhere. Normalizing
the error returns to `status: 1` makes the public shape consistent and
removes the need for the defensive typeof check at the executeReviewRun
seam.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a unit test that runs both recon turns to convergence, then queues a transport error on the finalize turn. Verifies that the investigation metadata (turnCount=2, truncated=false) is preserved even when finalize fails, and that the numeric status contract holds. Also tightens the existing phase-1 hard-error assertion with the status check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the truncation banner to tell the user how to raise the cap. The banner only fires when investigation hit the turn cap, which is exactly when the flag is actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6992596928
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const parsedTurns = Number.parseInt(rawMaxTurns, 10); | ||
| if (!Number.isFinite(parsedTurns) || parsedTurns <= 0) { |
There was a problem hiding this comment.
Reject non-integer max investigation turn values
The new --max-investigation-turns validation uses Number.parseInt, which silently accepts malformed values like 1.5 (becomes 1) or 10abc (becomes 10) even though the flag contract says this must be a positive integer. This can hide user typos and run significantly fewer or different investigation turns than requested, degrading review depth without any error signal. Please validate the full token (for example with a strict integer regex or Number(...) plus an integer check) before accepting it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d222542. The flag no longer uses `parseInt`; it validates the full token with a strict regex `/^[1-9][0-9]*$/` and throws on malformed input. Tests cover `1.5`, `10abc`, `1e2`, leading whitespace, `0`, negatives, and empty string — all rejected. (The same strict validation is applied to the new `--turn-idle-timeout` flag.)
Number.parseInt silently salvages "1.5" as 1, "10abc" as 10, " 5" as 5, and "1e2" via Number() as 100 — all of which violate the flag contract of "positive integer literal" and would silently change the depth of investigation without any error signal. Replace the parseInt-based check with a strict regex /^[1-9][0-9]*$/ so typos throw instead of degrading review depth. Reported by bot review on PR openai#328. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… message The legacy convergence check required `phase: "final_answer"` on the agent message, but recon turns run with outputSchema=null and the app-server does not always tag messages with that phase. Production runs hit the cap with the model emitting "My investigation is complete" on every turn — visible evidence the loop was ignoring the contract the investigate prompt teaches the model. Treat any 0-command turn that emits an agent message as convergence; this matches the prompt and unblocks the runaway-loop scenario. Test fixture: make finalAnswer.phase overridable so we can simulate the real-world case where recon agent messages lack the final_answer tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production observation: even with a clean recon convergence, the model
sometimes emits a tool-call stub (e.g. {"cmd": "wc -l README.md"})
during the finalize turn instead of the structured review JSON. The
companion's parser then surfaces "Validation error: Missing string
verdict" with the raw stub as the final message — a confusing failure
for the user when the investigation produced real findings.
Fix in two layers:
1. Tighten the finalize prompt: explicitly forbid running commands or
emitting anything other than the schema-conforming JSON.
2. Detect the contract violation in the orchestrator: if the finalize
turn's commandExecutions is non-empty, retry once with a STRICT
FINALIZE preamble that reiterates the no-tool-call rule. After two
attempts, surface whatever the model produced so the parser can
still flag malformed output instead of looping forever.
Test fixes:
- Adapt `respects maxInvestigationTurns` so the 4th queued entry is the
pure-JSON finalize turn (previous version queued 4 cmd-only entries
before finalize, accidentally relying on the old leniency).
- Add coverage for the new retry path: success-on-second-attempt and
give-up-after-two-attempts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a7a90c7d73
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (const cmd of finalizeState.commandExecutions) { | ||
| aggregatedCommandExecutions.push(cmd); | ||
| } |
There was a problem hiding this comment.
Prevent duplicate finalize command aggregation
When both finalize attempts run commands (the retry-exhausted path), the second attempt is appended once inside the retry loop and then appended again unconditionally after the loop, so commandExecutions/fileChanges contain duplicates of the last finalize turn. This skews downstream telemetry/audit consumers (e.g., command counts and changed-file traces) exactly in the contract-violation scenario this code is trying to report, so the final append should avoid re-adding entries already recorded in-loop.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d2157c2. The in-loop aggregation now lives inside the `finalizeAttempts < MAX_FINALIZE_ATTEMPTS` retry branch, so only superseded attempts are folded in there; the final attempt is aggregated exactly once after the loop. Previously the last attempt was appended both in-loop and post-loop, duplicating its command/file traces on the retry-exhausted path. The retry-exhausted test now asserts `commandExecutions` equals `["wc -l a", "wc -l b"]` — each attempt recorded once.
Production observation: a 2-file untracked review of two large markdown
docs (~100KB total) hit the inline-diff path, embedded both files into
a single-turn schema-pinned prompt, and the model responded with a
tool-call stub `{"cmd": "wc -l ..."}` instead of the review JSON. The
companion's parser then surfaced "Validation error: Missing string
verdict" — the bug PR openai#328 thought it had fixed, except inline-diff is
a separate code path from self-collect and was never touched.
Two layers:
1. Lower the inline cap from 2 files to 1. The inline path embeds full
file contents into a single turn with outputSchema pinned and has no
recovery if the model wants to investigate before producing a verdict.
Anything larger than 1 file falls through to the two-phase
self-collect path which tolerates exploratory recon turns.
2. Tighten the inline prompt: explicitly forbid running shell commands
or emitting tool-call stubs, with the `{"cmd": "..."}` anti-pattern
named so the model cannot rationalize emitting one.
Tests:
- New git.test.mjs case locks 2 files → self-collect.
- New investigation.test.mjs cases lock the prompt-contract for both
the inline and finalize prompts (no shell commands, no tool-call
stubs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 324836e24b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| if (turnState.error) { | ||
| return { | ||
| status: buildResultStatus(turnState), |
There was a problem hiding this comment.
Return non-zero status when investigation turn reports error
When an investigation turn emits a soft failure (turnState.error is set), this branch still derives status from buildResultStatus(turnState), which returns 0 whenever the turn ended as completed. In practice that makes /codex:adversarial-review exit successfully even though no valid review JSON was produced (only parseError), so CI/automation can treat a failed review run as success. This error path should force a non-zero status whenever turnState.error is present.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in d222542. `buildResultStatus` now returns 1 whenever `turnState.error` is set, so all four call sites (review, turn, recon soft-error branch, finalize) report a non-zero status when an error was observed — even if the turn ended `completed`. Covered by the soft-error test asserting `result.status === 1`.
When captureTurn records an error notification but the app-server still emits turn/completed with status="completed", buildResultStatus was returning 0. That made /codex:adversarial-review exit 0 even though no valid review JSON was produced (only parseError), so CI/automation could treat a failed run as success. Honor turnState.error in buildResultStatus so all four call sites (review, turn, investigation soft-error branch, finalize) report a non-zero status whenever an error was observed. Strengthen the existing soft-error test to assert result.status === 1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re: P1 — Return non-zero status when investigation turn reports errorAddressed in function buildResultStatus(turnState) {
+ if (turnState.error) {
+ return 1;
+ }
return turnState.finalTurn?.status === "completed" ? 0 : 1;
}End-to-end chain confirmed: Regression test (
|
…alls Three correctness fixes to the multi-turn self-collect investigation path, all surfaced by an unstable upstream connection (Reconnecting... storms): 1. converge-before-abort: the app-server multiplexes transient retry notices onto the same `error` notification channel as fatal failures. The recon loop aborted on ANY `turnState.error`, so a reconnect that recovered (and produced an agent message + completed turn) skipped the finalize turn and handed raw investigation prose to the JSON parser. Now abort only when the turn produced no usable output (`turnState.error && !turnRecovered`). 2. turn-level idle timeout: `client.request()` only settles on a response or a full socket close, so a half-dead reconnecting link left `turn/start` pending forever (observed as a permanent hang at "Investigation turn N"). captureTurn now races an idle watchdog against both the start request and the completion promise, re-armed on every progress notification so a slow-but-healthy turn is never killed. On timeout it fires a best-effort, non-blocking turn/interrupt and rejects. Default 180s, configurable via the new `--turn-idle-timeout SECONDS` flag. 3. failed run not parsed as JSON: when a turn fails (transport error/timeout) the leftover finalMessage is often just the model's opening prose line. parseStructuredOutput then surfaced a misleading "invalid JSON" parse error instead of the real failure reason. executeReviewRun now parses first and only reports failure when the run errored AND produced no usable structured output; renderReviewResult gained a "Codex could not complete the review" branch that states the real reason. A status-0 run returning malformed JSON still renders the existing parse-error path. Tests: queue-driven fake-codex gains queueTurnHang(); new coverage for the idle-timeout abort, the slow-but-healthy re-arm guard, transient-reconnect recovery at both recon and finalize boundaries, transport-failure rendering, the malformed-JSON regression guard, and --turn-idle-timeout validation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 510e750018
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| status: result.status, | ||
| failureMessage: result.error?.message ?? result.stderr | ||
| }); | ||
| const parsed = (runErrored && !structured.parsed) |
There was a problem hiding this comment.
Normalize recovered structured reviews to success
When a finalize turn emits valid JSON after a transient reconnect/error notification, this new branch correctly avoids marking the payload as failed, but the returned execution still uses exitStatus: result.status below. Because buildResultStatus now returns 1 whenever result.error is set, the foreground command exits non-zero and background jobs are recorded as failed even though payload.result contains a valid verdict and payload.failed is false. In the recovered-with-valid-structured-output case handled here, compute and return an effective success status instead of propagating the stale transient error status.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 7c68363. `executeReviewRun` now computes an effective exit status: `const exitStatus = (!payload.failed && parsed.parsed) ? 0 : result.status;`. When the finalize turn recovered and produced a valid verdict, the run exits 0 instead of propagating the stale transient error status, so the foreground command succeeds and background jobs are recorded as completed. A genuinely failed run still propagates its non-zero status. The finalize-recovery e2e test now asserts `result.status === 0`.
…d verdict Follow-up to the failed-run gate: executeReviewRun returned `exitStatus: result.status`, but buildResultStatus returns 1 whenever a transient reconnect set `result.error` — even when the finalize turn recovered and produced a valid structured verdict (payload.failed=false, payload.result populated). The foreground command then exited non-zero and background jobs were recorded as failed despite a successful review. Compute an effective exit status: 0 when we have a usable verdict and the run was not flagged failed, otherwise propagate the non-zero status. Extends the finalize-recovery e2e test to assert exit 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7c68363aa6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| This is a single-turn review. Do NOT run any shell commands or tool calls — the file contents you need are already embedded in the repository_context block below. Do not emit a tool-use stub like `{"cmd": "..."}` instead of the review; that is not the schema. | ||
| Return only valid JSON matching the provided schema. Your entire output must be that JSON — no prose before or after, no shell commands, no tool-call payloads. |
There was a problem hiding this comment.
Route skipped single-file reviews to self-collect
When the working tree contains exactly one untracked file whose contents are skipped (for example >24 KiB, binary, or unreadable), collectReviewContext can still choose the inline path because its byte check only measures git diffs and the file count is now within the one-file cap; the prompt then tells Codex that the needed contents are embedded and forbids any shell commands. In that scenario the review can only see a (skipped: ...) marker and must approve or guess without inspecting the actual change, so skipped untracked single-file reviews should fall through to self-collect or otherwise allow read-only inspection.
Useful? React with 👍 / 👎.
When both finalize attempts ran commands (the retry-exhausted path), the last attempt's commandExecutions/fileChanges were appended in-loop AND again unconditionally after the loop, duplicating its command/file traces in exactly the contract-violation scenario this code reports. Move the in-loop aggregation inside the retry branch so only superseded attempts are folded in there; the final attempt is aggregated once after the loop. Extends the retry-exhausted test to assert each attempt's commands are recorded exactly once. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2157c2ca8
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| result = await runAppServerTurn(context.repoRoot, { | ||
| prompt, | ||
| model: request.model, | ||
| sandbox: "read-only", | ||
| outputSchema: readOutputSchema(REVIEW_SCHEMA), | ||
| onProgress: request.onProgress |
There was a problem hiding this comment.
Wire idle timeout through inline reviews
For small single-file adversarial reviews that stay on the inline path, the newly advertised --turn-idle-timeout is never passed to runAppServerTurn here, and that helper still calls captureTurn with only onProgress. In this scenario a half-dead turn/start can still hang indefinitely despite the command docs promising the timeout applies when any turn stalls, so the same timeout needs to be forwarded and honored for the inline path as well as self-collect.
Useful? React with 👍 / 👎.
Summary
The self-collect path for
/codex:adversarial-reviewhas been broken sinceit was introduced: a single turn pinned
outputSchema, which forced everyagent emission to validate against the final-output schema, blocking the
shell investigation the prompt asked Codex to perform. Restructure the
path as a two-phase flow on the same Codex thread — a free-form
investigation phase with no schema, followed by a schema-enforced
finalization turn — so the contract is internally consistent and
large-diff reviews actually produce structured output. Codex's read-only
sandbox already permits the shell commands the investigation needs; no
new tool registration is required.
What changed
runAppServerInvestigationinlib/codex.mjsorchestrates a recon loop (no schema, free-form tool use) until convergence (finalAnswerSeen && commandCount === 0) or the turn cap (default 10), then runs a single finalize turn with the review schema set.executeReviewRunbranches oncontext.inputMode === "self-collect"to call the new orchestrator. Inline-diff payloads stay byte-identical.--max-investigation-turns Nwith positive-integer validation.Investigation truncated at N turns; findings may be shallow. Use --max-investigation-turns to raise the cap.on all three render paths (parse-error, validation-error, success) when the cap was hit.runAppServerInvestigationalways returns numericstatussopayload.codex.statusshape stays consistent withrunAppServerTurn.Test plan
npm test— 109/109 passing