Skip to content

Auto-recover one-session sync from legitimate test churn#1661

Merged
gltanaka merged 6 commits into
mainfrom
fix/issue-2208-test-churn-autorecover
Jun 19, 2026
Merged

Auto-recover one-session sync from legitimate test churn#1661
gltanaka merged 6 commits into
mainfrom
fix/issue-2208-test-churn-autorecover

Conversation

@Serhan-Asad

Copy link
Copy Markdown
Collaborator

Problem

pdd sync (one-session path) hard-fails when an LLM produces a large but legitimate test rewrite. The test-churn gate blocks purely on textual line churn (difflib ratio > PDD_TEST_CHURN_THRESHOLD, default 0.40) and runs only after the agentic session already reported success. When a prompt/code change is substantial, the regenerated test is ~100% different yet correct; the repair-directive retry cannot reduce genuine churn, so after the bounded retries the gate hard-fails an otherwise-successful result and forces manual intervention.

Observed in promptdriven/pdd_cloud#2163 (PR #2166): module hackathon_organizer_event_detail_page (typescriptreact) failed repeatedly with Test churn threshold exceeded, churn ratio: 1, burning ~$8–10 per run before aborting.

Fix

On churn exhaustion in one-session sync only (identical-signature short-circuit or the _MAX_TEST_CHURN_ATTEMPTS cap), accept the rewrite instead of hard-failing — but only when it preserves coverage. For every guarded pre-existing test file (canonical + alt-path snapshot), the candidate must:

  • keep the pre-session test-case count AND the assertion count,
  • carry ≥ 1 real assertion,
  • not be deleted,
  • be in a measurable language (Python via ast; TS/JS via a single-pass comment/string/regex/template scanner).

Anything else — unknown language, deletion, shrinking, vacuous (assertion-free) rewrite, unparseable source — keeps the strict gate unchanged, so the coverage-destruction protection (#1012 / #1015) is intact. On accept, the agent's candidate bytes (reverted by the retry's restore) are reapplied to disk; if any reapply write fails, acceptance is refused (no false success). A PDD_TEST_CHURN_ACCEPTED: marker + warning is emitted. The parallel sync_orchestration.py churn loop is intentionally left strict.

Kill switch: PDD_DISABLE_TEST_CHURN_AUTOACCEPT=1 forces the strict gate.

Why coverage counts (not textual churn)

Textual line churn is a poor proxy: a coverage-preserving reorganization scores ~1.0. Counting test cases + assertions distinguishes a legitimate modernization (kept/added coverage) from silent coverage destruction (dropped cases/assertions), which is exactly what the gate exists to catch.

Safety / what still hard-fails

  • Rewrites that drop test cases or assertions (even if the file grows longer).
  • Rewrites that gut assertions to zero.
  • Test-file deletions (canonical or alt-path).
  • Unmeasurable languages and unparseable Python (stay strict).
  • Reapply write failures (hard-fail, no marker).

Cross-repo

This is the CLI root-cause fix for promptdriven/pdd_cloud#2208. The cloud picks it up by bumping its pinned PDD_SOURCE_SHA; it can optionally surface the PDD_TEST_CHURN_ACCEPTED marker to tell users a test was substantially (but safely) rewritten.

Testing

  • tests/test_one_session_sync.py: +19 tests — accept on coverage-preserving rewrite (Python canonical, alt-path multi-file, TSX end-to-end); reject on dropped cases / gutted assertions / deletion / env-disable / write-failure / partial-write-failure; _count_test_units unit coverage (AST comment/string immunity, unparseable→None, JS method-call exclusion, chained modifiers, URL-in-string, template ${} interpolation, regex-literal exclusion).
  • Full tests/test_one_session_sync.py (104), test_code_generator_main.py churn gate (30), test_sync_orchestration.py churn (7), test_sync_main.py (95) all green.

Residual risk

TS/JS counting is a best-effort scanner (perfect counting needs a JS parser). Remaining false-accept paths are third-order and require adversarially-crafted input (e.g. an assertion dropped while a same-count expect( is hidden after a return-position regex), which a non-adversarial LLM rewrite does not produce. Measurement is symmetric, the result is human-reviewed in a PR with CI as the pass/fail backstop, and the env kill-switch disables the behavior entirely.

🤖 Generated with Claude Code

The one-session test-churn gate blocks purely on textual line churn
(difflib ratio > threshold) and runs only after the agentic session
already succeeded. When a prompt/code change is substantial the
regenerated test is ~100% different yet correct; the repair-directive
retry cannot reduce genuine churn, so sync hard-fails an otherwise
successful result and forces manual intervention.

On churn exhaustion in one-session sync, accept the rewrite instead of
hard-failing IFF it preserves coverage: every guarded pre-existing test
file (canonical + alt-path) keeps at least as many test cases AND
assertions, carries >=1 real assertion, deletes nothing, and is in a
language we can measure (Python via ast; TS/JS via a single-pass
comment/string/regex/template-aware scanner). Unknown languages,
deletions, shrinking rewrites, and assertion-free rewrites keep the
strict gate, so the coverage-destruction protection is unchanged. The
agent's candidate bytes are reapplied (the retry restore reverts them);
if any reapply write fails, the partial reapply is rolled back to
pre-session bytes and the strict gate hard-fails. A
PDD_TEST_CHURN_ACCEPTED marker is emitted; PDD_DISABLE_TEST_CHURN_AUTOACCEPT
forces the strict gate. The sync_orchestration parallel churn loop is
left strict.

Keep the one_session_sync prompt + README in sync with the new behavior
so a future regen does not revert the fix.

Fixes promptdriven/pdd_cloud#2208

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Serhan-Asad Serhan-Asad force-pushed the fix/issue-2208-test-churn-autorecover branch from 5efb7ac to 405660f Compare June 19, 2026 00:49
@Serhan-Asad

Copy link
Copy Markdown
Collaborator Author

Independent verification — done in an isolated worktree

Ran a full verification pass on this branch (separate worktree + venv, dedicated checkout at the PR head). Summary: the change is sound, narrowly scoped, and behaves exactly as described — including a live agentic run.

Deterministic verification

Check Result
CI / rebase All checks green (incl. the ~17-min unit job); base has no commits touching one_session_sync.py since branch point, so no rebase conflict
Full suite 755 / 755 pass across test_one_session_sync.py, test_code_generator_main.py (churn gate), test_sync_orchestration.py (churn), test_sync_main.py
Tests are load-bearing Neutralizing _accept_churn_rewrite_on_exhaustionreturn False failed exactly the 5 accept/rollback tests; every reject / strict-gate / kill-switch / counter test still passed. The new tests genuinely exercise the new code, and the strict gate remains the fallback.
Adversarial counter probe 17/17 hand-checked ground-truth cases pass; _candidate_preserves_coverage correctly rejects dropped-case, dropped-assertion, gutted (zero-assertion), deleted, and unmeasurable-language rewrites
Real-world data The production counter on a real 1604-line .tsx test from the motivating scenario returns (239 cases, 523 assertions) — the case count matches a naive grep exactly, confirming the scanner tracks real, well-formatted TSX accurately
Strict path preserved The parallel sync_orchestration churn loop is not in the diff and stays strict by construction; the existing exhaustion test still hard-fails (its fixtures are assertion-free, which the new post_asserts >= 1 guard rejects)

Live agentic end-to-end

Drove the real run_one_session_sync with a live Claude agent (real claude -p, staging OAuth credential) generating the test rewrite through the same run_agentic_task seam the unit tests stub. A low churn threshold forced the churn path so the real accept/reject behavior could be observed on genuine agent-generated TSX:

  • Auto-accept ON: the agent produced a coverage-preserving high-churn rewrite (ratio 0.95, cases/asserts 3→3). Result: success, PDD_TEST_CHURN_ACCEPTED marker emitted, PDD_PHASE: synced, agent's rewrite kept on disk. ✅
  • Kill-switch ON (PDD_DISABLE_TEST_CHURN_AUTOACCEPT=1): same rewrite → raises TestChurnError, file restored to pre-session bytes, PDD_PHASE: conflict. ✅

Residual risk (already documented in the PR)

The TS/JS scanner under-counts when an expect(/it( shares a line after a </tag> or division / (the regex heuristic consumes the line tail). This can in principle false-accept a coverage drop, but only with pathological single-line input that a non-adversarial LLM rewrite does not produce — real multi-line TSX counts correctly (confirmed on the real file above). Bounded by the kill-switch, downstream pdd verify / CI, and human review. Acceptable as documented; not a blocker.

Verdict: verified, no regressions found, merge-ready.

🤖 Generated with Claude Code

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not merge as-is.

The underlying issue is real: one-session sync can hard-fail after a legitimate high-churn test rewrite, and the retry loop alone may not converge. But the TS/JS coverage-preserving gate is not safe enough to convert that hard-fail into success yet.

Blocker: the TSX counter can undercount baseline assertions in a common React-test shape where a JSX closing tag is followed by assertions on the same line. Repro on this branch:

from pdd.one_session_sync import _count_test_units, _candidate_preserves_coverage

pre = "it(\"x\", () => { render(<Provider><Widget /></Provider>); expect(a).toBe(1); expect(b).toBe(2); });\n"
post = "it(\"x\", () => { expect(a).toBe(1); });\n"

print(_count_test_units(pre, "tests/widget.test.tsx"))   # (1, 0)
print(_count_test_units(post, "tests/widget.test.tsx"))  # (1, 1)
print(_candidate_preserves_coverage(pre, post, "tests/widget.test.tsx"))  # True

_strip_js_noncode treats the slash in </Provider> as a regex literal opener and strips the rest of the line, so the existing assertions after the JSX closing tag disappear from the baseline count. A candidate with fewer real assertions can then pass _candidate_preserves_coverage and be auto-accepted. This is not just an adversarial case; compact render(...); expect(...) TSX tests are normal, and TSX is the motivating language for this PR.

Required changes before merge:

  • Make the TS/JS counter JSX-aware enough to handle closing tags and same-line assertions, including render(<Provider><Widget /></Provider>); expect(...).
  • Add a regression test at the _candidate_preserves_coverage level showing that dropping assertions from that TSX shape is rejected.
  • If robust TSX counting is too much for this PR, keep TS/JS on the strict churn gate and limit auto-accept to languages with reliable measurement.

Until then, this can silently accept coverage loss while reporting PDD_PHASE: synced, which is exactly the class of failure the churn gate exists to prevent.

The TS/JS coverage counter treated the slash in a JSX closing tag
(</Provider>) and a self-closing tag (<Widget count={n} />) as a regex
opener, so _strip_js_noncode stripped the rest of the line and dropped
any assertions written after the tag on the SAME line. That undercounted
the baseline (e.g. render(<Provider><Widget /></Provider>); expect(a);
expect(b); measured 0 assertions) and could auto-accept a rewrite that
actually drops assertions from a compact React test — the exact coverage
loss the churn gate exists to prevent.

Fix: exclude the JSX tag-boundary chars < and > from the regex-opener set
and never treat /> as a regex opener, so same-line assertions after a
closing or self-closing tag are counted. Regex-literal stripping
(/expect(/, getByText(/loading/i), toMatch(/foo/), split(/,/)) is
preserved.

Adds regression tests at both the counter and _candidate_preserves_coverage
levels for the closing-tag and self-closing-with-expr-attr shapes,
including that dropping assertions from that TSX shape is now rejected.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Serhan-Asad

Copy link
Copy Markdown
Collaborator Author

Thanks — reproduced your exact case and fixed it at the counter. You were right that this is a normal compact-React shape, not just adversarial.

Root cause: _strip_js_noncode treated the slash in a JSX closing tag (</Provider>) as a regex-literal opener — < was in the regex-opener char set — so it stripped the rest of the line and the assertions written after the tag on the same line vanished from the baseline count. The self-closing form (<Widget count={n} />) had the same problem via the } before />.

Fix (pushed in 11be289c):

  • Removed the JSX tag-boundary chars < and > from _REGEX_PREV_CHARS (a real regex literal essentially never follows a </> comparison in test code; JSX tags constantly do).
  • Added an explicit branch so a / immediately followed by > (self-closing />) is kept as code, regardless of the preceding char.
  • Regex-literal stripping is preserved for the genuine cases: /expect(/, getByText(/loading/i), toMatch(/foo/), split(/,/) all still count correctly.

Your repro now:

pre  = 'it("x", () => { render(<Provider><Widget /></Provider>); expect(a).toBe(1); expect(b).toBe(2); });\n'
post = 'it("x", () => { expect(a).toBe(1); });\n'
_count_test_units(pre,  "tests/widget.test.tsx")   # (1, 2)   <- was (1, 0)
_count_test_units(post, "tests/widget.test.tsx")   # (1, 1)
_candidate_preserves_coverage(pre, post, "tests/widget.test.tsx")  # False  <- was True

Regression tests added (both levels you asked for):

  • TestCountTestUnits::test_tsx_closing_tag_does_not_swallow_same_line_assertions and ::test_tsx_self_closing_tag_with_expr_attr_keeps_assertions — counter-level, for </Provider> and <Widget count={n} /> shapes.
  • TestTestChurnAcceptOnExhaustion::test_tsx_dropping_assertions_after_closing_tag_is_rejected_candidate_preserves_coverage-level, asserting that dropping assertions from that exact TSX shape is now rejected (and a genuine coverage-preserving rewrite of the same shape is accepted).

Verification: full test_one_session_sync + test_code_generator_main + test_sync_orchestration + test_sync_main = 758 passed; the regex-protection tests still pass; and a live agentic run (real claude rewriting a TSX test) accepts a coverage-preserving rewrite with the PDD_TEST_CHURN_ACCEPTED marker while the kill-switch path still hard-fails. The narrow residual is now limited to genuinely pathological input (e.g. a regex literal containing the literal text expect( placed right after a </>), well outside what a real LLM rewrite produces.

@gltanaka gltanaka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update resolves my previous blocker. I re-ran the exact TSX false-accept repro and it now counts the baseline as (1, 2), the candidate as (1, 1), and _candidate_preserves_coverage(...) returns False as required. The new regression tests cover both the counter and decision levels, including the compact render(<Provider><Widget /></Provider>); expect(...) shape.

Local verification on the updated PR:

  • conda run -n pdd pytest tests/test_one_session_sync.py -q -> 108 passed
  • conda run -n pdd pytest tests/test_code_generator_main.py tests/test_sync_orchestration.py tests/test_sync_main.py -q -> 650 passed

This still relies on a coverage-count heuristic rather than semantic equivalence, but the remaining risk is bounded: it only runs after one-session churn recovery is exhausted, refuses deletions/shrinking/unmeasurable rewrites, emits PDD_TEST_CHURN_ACCEPTED, and has PDD_DISABLE_TEST_CHURN_AUTOACCEPT as a strict-gate kill switch. Merge makes sense now.

@gltanaka gltanaka merged commit 033f32a into main Jun 19, 2026
9 checks passed
@gltanaka gltanaka deleted the fix/issue-2208-test-churn-autorecover branch June 19, 2026 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants