fix(agentic): spool Codex JSONL output to disk to cap parent RSS (#1646)#1654
Conversation
62ac9d1 to
551921d
Compare
Codex `exec --json` can emit tens-to-hundreds of MiB of NDJSON. The openai path captured it via `capture_output=True` + `communicate()` (the whole decoded transcript resident on the parent heap), then `.strip()` and `.splitlines()` copied it again — so peak parent RSS reached ~3-4x the transcript, crossed container memory limits, and the process was SIGKILLed (-9), surfaced by the cloud as OOM. Spool the Codex subprocess stdout/stderr to temp files and parse the NDJSON incrementally from disk with a bounded per-line read. openai only; every other provider keeps the existing in-RAM path unchanged. On a 128 MiB transcript, peak RSS drops from ~776 MiB to ~259 MiB with identical parse output. - `_subprocess_run_spooled`: `communicate()` with stdout/stderr bound to `tempfile.TemporaryFile` (preserves the timeout + process-group kill of `_subprocess_run`); closes both files on any error. `_subprocess_run` itself is unchanged so its ~20 small-output callers are unaffected. - `_parse_codex_jsonl`: one shared incremental NDJSON parser for both the spooled and in-RAM paths; preserves the result/agent_message/session.end/turn.completed selection and the usage+model merge. - `_iter_spooled_lines`: reads each line with a bounded `readline(cap)` (16 MiB) so the parent never materializes more than ~cap at a time. An oversize line is classified by its event TYPE in the already-read prefix (probing only the first few KiB): a giant tool-output blob is drained and dropped without being decoded, while a retained-type line (result/agent_message/session.end/turn.completed or any error/failed event) is read in full and yielded — never silently dropped — up to a larger hard ceiling (64 MiB) beyond which it too is dropped rather than risk OOM. - `PDD_AGENT_SPOOL_DIR`: optional disk-backed spool dir (system temp may be tmpfs/RAM-backed). A full spool fails the run with a clear error rather than a silent OOM; when set but not a writable directory, PDD warns once and falls back instead of silently ignoring it. Also update the generating prompt (so a future `pdd sync` preserves the contract instead of regenerating it away), document `PDD_AGENT_SPOOL_DIR` in the README, and refresh the .pdd fingerprint + run report so the module reads in-sync. Regression tests: a peak-RSS test (large vs tiny transcript in fresh interpreters) that fails on the pre-fix path; spooled-path parse/error/blank tests; oversize-line tests covering blob-drop, retained-event keep, the hard ceiling, and the per-line read bound; and spool-dir resolution/warning tests. The openai tests in test_e2e_issue_557 / test_time_reasoning_effort_env / test_agentic_common_issue_813 now also mock the spooled runner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
551921d to
7baee43
Compare
|
Staging1 verification complete for @Serhan-Asad. What was verified:
Cloud Run staging verifier:
This specifically exercises the Cloud Run worker image path that previously buffered large Codex JSONL output and validates that the spooled parser keeps parent RSS bounded. |
gltanaka
left a comment
There was a problem hiding this comment.
Reviewed for necessity and merge readiness.
The fix is needed: #1646 is a real failure mode. The current OpenAI/Codex path captures the full codex exec --json stdout as a decoded string and then parses it with whole-transcript operations (strip() / splitlines()), so large JSONL transcripts can amplify parent RSS enough to OOM. The spooling direction and the focused tests look reasonable; I also ran the new targeted Codex spooling tests locally and they passed.
Do not merge this PR as-is because it is currently not mergeable with main. GitHub reports mergeable=CONFLICTING, and a local merge simulation against current origin/main conflicts in:
README.mdpdd/prompts/agentic_common_python.prompt
Required changes before merge:
- Rebase or merge the latest
maininto this branch and resolve the conflicts. - Preserve both contracts while resolving: the new Codex/OpenAI spooling behavior from this PR and the newer agentic routing/effort behavior already on
main. - Refresh any generated
.pddmetadata/run report files that are expected to stay in sync after the prompt/code resolution. - Rerun the targeted spooling tests plus the normal required validation (
make test/ lint, or the project-equivalent CI set).
After that update, the spooling change is worth re-reviewing for merge.
|
Addressed @gltanaka's merge-readiness review. What changed:
Local validation:
Staging verification from earlier still applies: staging1 Cloud Run job with a fake 256 MiB Codex JSONL transcript completed with Remote CI on head
Note: I also attempted the broad local The PR is now |
gltanaka
left a comment
There was a problem hiding this comment.
The updated branch resolves the earlier mergeability blocker. I do not think a real Codex/cloud E2E should be required for this PR: the risk being fixed is PDD's parent-process stdout buffering and JSONL parsing behavior, and the fake-Codex subprocess tests exercise that actual local runner/parser path deterministically without auth, provider cost, or cloud flake. CI is green and the focused spooling/RSS tests pass locally.
gltanaka
left a comment
There was a problem hiding this comment.
Re-approving after updating against current main. I still do not think a real Codex/cloud E2E should be required: the failure is in PDD's local subprocess buffering/parsing path, and the fake-Codex subprocess plus RSS regression tests exercise that deterministically without provider auth/cost/flakiness.
gltanaka
left a comment
There was a problem hiding this comment.
Re-approving after the latest update against main. I still do not think a real Codex/cloud E2E should be required for this local subprocess buffering/parser fix; the deterministic fake-Codex subprocess and RSS regression coverage target the actual failure surface.
gltanaka
left a comment
There was a problem hiding this comment.
Re-approving after updating against the latest main commit. True external Codex/cloud E2E is still not required for this deterministic local subprocess-buffering fix.
gltanaka
left a comment
There was a problem hiding this comment.
Re-approving after another update against current main. No true external Codex/cloud E2E required for this local subprocess buffering/parser fix.
Problem
Closes #1646.
pdd/agentic_common.py::_run_with_provider()runs the Codex CLI (provider == "openai") with_subprocess_run(..., capture_output=True, text=True)→Popen(stdout=PIPE, stderr=PIPE)+communicate(), holding the entirecodex exec --jsonNDJSON transcript in the parent heap as a decodedstr. The openai parse path then made two more full copies —result.stdout.strip()and.splitlines(). Codex transcripts for long agentic runs reach tens-to-hundreds of MiB, so peak parent RSS hit ~3–4× the transcript, crossed container memory limits, and the process wasSIGKILLed (exit_code=-9) — surfaced by the cloud wrapper as OOM.Fix
Spool the Codex subprocess stdout/stderr to temp files and parse the NDJSON incrementally from disk, keeping only bounded head/tail snippets plus the few retained events in memory.
openai-only; every other provider (anthropic/google/opencode/agy) keeps the existing in-RAM path unchanged._subprocess_run_spooled—communicate()withstdout/stderrbound totempfile.TemporaryFileinstead of pipes. Preserves the timeout +os.killpgprocess-group kill of_subprocess_run, and closes both temp files on any error._subprocess_runitself is untouched, so its ~20 small-output callers are unaffected._parse_codex_jsonl— one shared incremental NDJSON parser used by both the spooled and in-RAM paths; preserves the existingresult/item.completed→agent_message/session.end/turn.completedselection and theusage+modelmerge._iter_spooled_lines— streams decoded lines; an oversize line (>16 MiB) is skipped only when it carries none of the retained-event markers, so a genuinely largeagent_message/result/error event can never be silently dropped (markers matched case-insensitively to mirror the error scanner).PDD_AGENT_SPOOL_DIR— optional disk-backed spool dir. The system temp dir may be tmpfs/RAM-backed in some containers; pointing this at real disk also relieves cgroup memory.Proof (peak RSS, identical 128 MiB Codex transcript, same parse result)
~67% reduction; peak no longer scales with transcript size.
Tests
test_openai_codex_spool_caps_peak_rss— the Codex JSONL agent output buffering can SIGKILL PDD parent #1646 reproduction: runs the same workload twice in fresh interpreters (tiny vs ≥128 MiB) and asserts the large-vs-tiny RSS delta stays small. Fails on the pre-fix path (delta ~473 MiB), passes here (~0).test_openai_codex_jsonl_spooled_parses_large_transcript/…_error_extracted/…_blank_stdout— spooled-path parse/error/blank behavior through a real fake-codex binary.test_iter_spooled_lines_skips_oversize_without_decoding/…_keeps_oversize_retained_event(parametrized over agent_message/result/error/failed incl. uppercase type) — the oversize-line guard never drops a retained event and never decodes a dropped one.test_e2e_issue_557_codex_ndjson.py,test_time_reasoning_effort_env.py, andtest_agentic_common_issue_813_…updated to also mock_subprocess_run_spooled(they drive the openai path through_run_with_provider/run_agentic_task).Validation commands run
Risk / tradeoff
This trades parent RAM for disk: the spool needs free space in the spool dir. A full spool fails the run with a clear error rather than a silent OOM — strictly better behavior for the memory-constrained path this fixes, and tunable via
PDD_AGENT_SPOOL_DIR. Stream-from-pipe (no temp files) was considered but rejected for this fix as higher-regression (owning stdin/stdout/stderr drains, timeout, and process-group kill manually); disk spooling keeps the provencommunicate()semantics.🤖 Generated with Claude Code