Skip to content

fix(agentic): spool Codex JSONL output to disk to cap parent RSS (#1646)#1654

Merged
gltanaka merged 7 commits into
mainfrom
fix/1646-codex-jsonl-spooling
Jun 19, 2026
Merged

fix(agentic): spool Codex JSONL output to disk to cap parent RSS (#1646)#1654
gltanaka merged 7 commits into
mainfrom
fix/1646-codex-jsonl-spooling

Conversation

@Serhan-Asad

Copy link
Copy Markdown
Collaborator

Problem

Closes #1646.

pdd/agentic_common.py::_run_with_provider() runs the Codex CLI (provider == "openai") with _subprocess_run(..., capture_output=True, text=True)Popen(stdout=PIPE, stderr=PIPE) + communicate(), holding the entire codex exec --json NDJSON transcript in the parent heap as a decoded str. The openai parse path then made two more full copies — result.stdout.strip() and .splitlines(). Codex transcripts for long agentic runs reach tens-to-hundreds of MiB, so peak parent RSS hit ~3–4× the transcript, crossed container memory limits, and the process was SIGKILLed (exit_code=-9) — surfaced by the cloud wrapper as OOM.

Fix

Spool the Codex subprocess stdout/stderr to temp files and parse the NDJSON incrementally from disk, keeping only bounded head/tail snippets plus the few retained events in memory. openai-only; every other provider (anthropic/google/opencode/agy) keeps the existing in-RAM path unchanged.

  • _subprocess_run_spooledcommunicate() with stdout/stderr bound to tempfile.TemporaryFile instead of pipes. Preserves the timeout + os.killpg process-group kill of _subprocess_run, and closes both temp files on any error. _subprocess_run itself is untouched, so its ~20 small-output callers are unaffected.
  • _parse_codex_jsonl — one shared incremental NDJSON parser used by both the spooled and in-RAM paths; preserves the existing result / item.completedagent_message / session.end / turn.completed selection and the usage+model merge.
  • _iter_spooled_lines — streams decoded lines; an oversize line (>16 MiB) is skipped only when it carries none of the retained-event markers, so a genuinely large agent_message/result/error event can never be silently dropped (markers matched case-insensitively to mirror the error scanner).
  • PDD_AGENT_SPOOL_DIR — optional disk-backed spool dir. The system temp dir may be tmpfs/RAM-backed in some containers; pointing this at real disk also relieves cgroup memory.

Proof (peak RSS, identical 128 MiB Codex transcript, same parse result)

peak RSS result
before (origin/main, in-RAM) 776 MiB ok, "Final answer."
after (this PR, spooled) 259 MiB ok, "Final answer."

~67% reduction; peak no longer scales with transcript size.

Tests

  • test_openai_codex_spool_caps_peak_rss — the Codex JSONL agent output buffering can SIGKILL PDD parent #1646 reproduction: runs the same workload twice in fresh interpreters (tiny vs ≥128 MiB) and asserts the large-vs-tiny RSS delta stays small. Fails on the pre-fix path (delta ~473 MiB), passes here (~0).
  • test_openai_codex_jsonl_spooled_parses_large_transcript / …_error_extracted / …_blank_stdout — spooled-path parse/error/blank behavior through a real fake-codex binary.
  • test_iter_spooled_lines_skips_oversize_without_decoding / …_keeps_oversize_retained_event (parametrized over agent_message/result/error/failed incl. uppercase type) — the oversize-line guard never drops a retained event and never decodes a dropped one.
  • Cross-file: openai tests in test_e2e_issue_557_codex_ndjson.py, test_time_reasoning_effort_env.py, and test_agentic_common_issue_813_… updated to also mock _subprocess_run_spooled (they drive the openai path through _run_with_provider/run_agentic_task).

Validation commands run

# core + the cross-file mock updates (all green)
PDD_PATH=<repo> PYTHONPATH=<repo> pytest tests/test_agentic_common.py \
  tests/test_e2e_issue_557_codex_ndjson.py \
  tests/test_agentic_common_issue_813_anthropic_api_key_oauth_shadow.py \
  tests/test_time_reasoning_effort_env.py -p no:randomly -q
# -> 552 passed

Risk / tradeoff

This trades parent RAM for disk: the spool needs free space in the spool dir. A full spool fails the run with a clear error rather than a silent OOM — strictly better behavior for the memory-constrained path this fixes, and tunable via PDD_AGENT_SPOOL_DIR. Stream-from-pipe (no temp files) was considered but rejected for this fix as higher-regression (owning stdin/stdout/stderr drains, timeout, and process-group kill manually); disk spooling keeps the proven communicate() semantics.

🤖 Generated with Claude Code

@Serhan-Asad Serhan-Asad force-pushed the fix/1646-codex-jsonl-spooling branch from 62ac9d1 to 551921d Compare June 19, 2026 00:19
Codex `exec --json` can emit tens-to-hundreds of MiB of NDJSON. The
openai path captured it via `capture_output=True` + `communicate()`
(the whole decoded transcript resident on the parent heap), then
`.strip()` and `.splitlines()` copied it again — so peak parent RSS
reached ~3-4x the transcript, crossed container memory limits, and the
process was SIGKILLed (-9), surfaced by the cloud as OOM.

Spool the Codex subprocess stdout/stderr to temp files and parse the
NDJSON incrementally from disk with a bounded per-line read. openai only;
every other provider keeps the existing in-RAM path unchanged. On a
128 MiB transcript, peak RSS drops from ~776 MiB to ~259 MiB with
identical parse output.

- `_subprocess_run_spooled`: `communicate()` with stdout/stderr bound to
  `tempfile.TemporaryFile` (preserves the timeout + process-group kill of
  `_subprocess_run`); closes both files on any error. `_subprocess_run`
  itself is unchanged so its ~20 small-output callers are unaffected.
- `_parse_codex_jsonl`: one shared incremental NDJSON parser for both the
  spooled and in-RAM paths; preserves the
  result/agent_message/session.end/turn.completed selection and the
  usage+model merge.
- `_iter_spooled_lines`: reads each line with a bounded `readline(cap)`
  (16 MiB) so the parent never materializes more than ~cap at a time. An
  oversize line is classified by its event TYPE in the already-read prefix
  (probing only the first few KiB): a giant tool-output blob is drained and
  dropped without being decoded, while a retained-type line
  (result/agent_message/session.end/turn.completed or any error/failed
  event) is read in full and yielded — never silently dropped — up to a
  larger hard ceiling (64 MiB) beyond which it too is dropped rather than
  risk OOM.
- `PDD_AGENT_SPOOL_DIR`: optional disk-backed spool dir (system temp may be
  tmpfs/RAM-backed). A full spool fails the run with a clear error rather
  than a silent OOM; when set but not a writable directory, PDD warns once
  and falls back instead of silently ignoring it.

Also update the generating prompt (so a future `pdd sync` preserves the
contract instead of regenerating it away), document `PDD_AGENT_SPOOL_DIR`
in the README, and refresh the .pdd fingerprint + run report so the
module reads in-sync.

Regression tests: a peak-RSS test (large vs tiny transcript in fresh
interpreters) that fails on the pre-fix path; spooled-path
parse/error/blank tests; oversize-line tests covering blob-drop,
retained-event keep, the hard ceiling, and the per-line read bound; and
spool-dir resolution/warning tests. The openai tests in test_e2e_issue_557
/ test_time_reasoning_effort_env / test_agentic_common_issue_813 now also
mock the spooled runner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Serhan-Asad Serhan-Asad force-pushed the fix/1646-codex-jsonl-spooling branch from 551921d to 7baee43 Compare June 19, 2026 00:50
@Serhan-Asad

Copy link
Copy Markdown
Collaborator Author

Staging1 verification complete for @Serhan-Asad.

What was verified:

  • PR head: 7baee43b61a1556c5446d904b4ca2ec69f1da66d
  • PR CI is green: Unit Tests, Public CLI Regression, Package Preprocess Smoke, CodeQL, auto-heal/heal all succeeded.
  • Focused local verification already passed on the PR checkout:
    • oversized retained Codex event probe preserved the final agent_message
    • focused suite: 9 passed, 472 deselected
    • broader agentic/Codex suite: 553 passed
  • Deployed only to staging1 project prompt-driven-development-stg; no prod deploy was run.
  • Staging image tag: us-docker.pkg.dev/prompt-driven-development-stg/pdd/worker:pr1654-7baee43b
  • Staging worker/executor are using that image.
  • Staging worker env includes PDD_AGENT_SPOOL_DIR=/home/appuser/pdd-agent-spool.
  • Staging health checks passed:
    • worker /health: {"status":"healthy"}
    • webhook /health: {"status":"healthy","webhook_subscriptions":"ok"}

Cloud Run staging verifier:

  • One-off job: pdd-pr1654-spool-verify
  • Execution: pdd-pr1654-spool-verify-ghrlb
  • Memory limit: 1Gi
  • Fake Codex JSONL transcript: 256 MiB
  • Verified installed PDD SHA from /opt/pdd-repo: 7baee43b61a1556c5446d904b4ca2ec69f1da66d
  • Result: PR1654_VERIFY_PASS
  • Parser preserved final text and model: PR1654 staging verifier final answer, gpt-5
  • Max RSS stayed flat: 214.36 MiB -> 220.48 MiB

This specifically exercises the Cloud Run worker image path that previously buffered large Codex JSONL output and validates that the spooled parser keeps parent RSS bounded.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for necessity and merge readiness.

The fix is needed: #1646 is a real failure mode. The current OpenAI/Codex path captures the full codex exec --json stdout as a decoded string and then parses it with whole-transcript operations (strip() / splitlines()), so large JSONL transcripts can amplify parent RSS enough to OOM. The spooling direction and the focused tests look reasonable; I also ran the new targeted Codex spooling tests locally and they passed.

Do not merge this PR as-is because it is currently not mergeable with main. GitHub reports mergeable=CONFLICTING, and a local merge simulation against current origin/main conflicts in:

  • README.md
  • pdd/prompts/agentic_common_python.prompt

Required changes before merge:

  • Rebase or merge the latest main into this branch and resolve the conflicts.
  • Preserve both contracts while resolving: the new Codex/OpenAI spooling behavior from this PR and the newer agentic routing/effort behavior already on main.
  • Refresh any generated .pdd metadata/run report files that are expected to stay in sync after the prompt/code resolution.
  • Rerun the targeted spooling tests plus the normal required validation (make test / lint, or the project-equivalent CI set).

After that update, the spooling change is worth re-reviewing for merge.

@Serhan-Asad

Copy link
Copy Markdown
Collaborator Author

Addressed @gltanaka's merge-readiness review.

What changed:

  • Merged latest main into fix/1646-codex-jsonl-spooling.
  • Resolved the README.md conflict and kept both PDD_AGENT_SPOOL_DIR and main's PDD_ROUTING_POLICY docs.
  • Resolved pdd/prompts/agentic_common_python.prompt by preserving main's newer routing/effort contract and this PR's Codex JSONL spooling contract.
  • Refreshed the .pdd metadata/run report for agentic_common_python.
  • Confirmed git diff --cached --check before commit and no conflict markers remain in the touched prompt/docs.

Local validation:

  • tests/test_agentic_common.py -k 'spool or spooled or model_reasoning_effort or reasoning_effort or effort' -p no:randomly -q -> 36 passed
  • tests/test_agentic_common.py tests/test_e2e_issue_557_codex_ndjson.py tests/test_time_reasoning_effort_env.py tests/test_agentic_common_issue_813_anthropic_api_key_oauth_shadow.py -p no:randomly -q -> 571 passed
  • tests/test_routing_policy.py tests/test_llm_invoke_task_routing.py tests/test_agentic_multishot.py -p no:randomly -q -> 81 passed
  • tests/test_agentic_common.py tests/test_agentic_common_issue_813_anthropic_api_key_oauth_shadow.py tests/test_agentic_common_worktree.py -p no:randomly -q -> 613 passed

Staging verification from earlier still applies: staging1 Cloud Run job with a fake 256 MiB Codex JSONL transcript completed with PR1654_VERIFY_PASS; RSS stayed bounded at about 214.36 MiB -> 220.48 MiB. Details: #1654 (comment)

Remote CI on head 766bb94954f51c6cc796813dc6abf0dc0e8883e8 is now green:

  • Run Unit Tests
  • Public CLI Regression
  • Package Preprocess Smoke
  • CodeQL / Analyze actions / JS-TS / Python
  • auto-heal / heal

Note: I also attempted the broad local make test path, but it enables live LLM/local environment paths and was not a clean local gate in my machine environment. The required GitHub CI set is passing on the pushed head.

The PR is now MERGEABLE; the only remaining block is the existing CHANGES_REQUESTED review state. @gltanaka please re-review when you have a chance.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The updated branch resolves the earlier mergeability blocker. I do not think a real Codex/cloud E2E should be required for this PR: the risk being fixed is PDD's parent-process stdout buffering and JSONL parsing behavior, and the fake-Codex subprocess tests exercise that actual local runner/parser path deterministically without auth, provider cost, or cloud flake. CI is green and the focused spooling/RSS tests pass locally.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after updating against current main. I still do not think a real Codex/cloud E2E should be required: the failure is in PDD's local subprocess buffering/parsing path, and the fake-Codex subprocess plus RSS regression tests exercise that deterministically without provider auth/cost/flakiness.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after the latest update against main. I still do not think a real Codex/cloud E2E should be required for this local subprocess buffering/parser fix; the deterministic fake-Codex subprocess and RSS regression coverage target the actual failure surface.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after updating against the latest main commit. True external Codex/cloud E2E is still not required for this deterministic local subprocess-buffering fix.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving after another update against current main. No true external Codex/cloud E2E required for this local subprocess buffering/parser fix.

@gltanaka gltanaka merged commit b96e973 into main Jun 19, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex JSONL agent output buffering can SIGKILL PDD parent

2 participants