Auto-recover one-session sync from legitimate test churn by Serhan-Asad · Pull Request #1661 · promptdriven/pdd

Serhan-Asad · 2026-06-19T00:16:03Z

Problem

pdd sync (one-session path) hard-fails when an LLM produces a large but legitimate test rewrite. The test-churn gate blocks purely on textual line churn (difflib ratio > PDD_TEST_CHURN_THRESHOLD, default 0.40) and runs only after the agentic session already reported success. When a prompt/code change is substantial, the regenerated test is ~100% different yet correct; the repair-directive retry cannot reduce genuine churn, so after the bounded retries the gate hard-fails an otherwise-successful result and forces manual intervention.

Observed in promptdriven/pdd_cloud#2163 (PR #2166): module hackathon_organizer_event_detail_page (typescriptreact) failed repeatedly with Test churn threshold exceeded, churn ratio: 1, burning ~$8–10 per run before aborting.

Fix

On churn exhaustion in one-session sync only (identical-signature short-circuit or the _MAX_TEST_CHURN_ATTEMPTS cap), accept the rewrite instead of hard-failing — but only when it preserves coverage. For every guarded pre-existing test file (canonical + alt-path snapshot), the candidate must:

keep ≥ the pre-session test-case count AND ≥ the assertion count,
carry ≥ 1 real assertion,
not be deleted,
be in a measurable language (Python via ast; TS/JS via a single-pass comment/string/regex/template scanner).

Anything else — unknown language, deletion, shrinking, vacuous (assertion-free) rewrite, unparseable source — keeps the strict gate unchanged, so the coverage-destruction protection (#1012 / #1015) is intact. On accept, the agent's candidate bytes (reverted by the retry's restore) are reapplied to disk; if any reapply write fails, acceptance is refused (no false success). A PDD_TEST_CHURN_ACCEPTED: marker + warning is emitted. The parallel sync_orchestration.py churn loop is intentionally left strict.

Kill switch: PDD_DISABLE_TEST_CHURN_AUTOACCEPT=1 forces the strict gate.

Why coverage counts (not textual churn)

Textual line churn is a poor proxy: a coverage-preserving reorganization scores ~1.0. Counting test cases + assertions distinguishes a legitimate modernization (kept/added coverage) from silent coverage destruction (dropped cases/assertions), which is exactly what the gate exists to catch.

Safety / what still hard-fails

Rewrites that drop test cases or assertions (even if the file grows longer).
Rewrites that gut assertions to zero.
Test-file deletions (canonical or alt-path).
Unmeasurable languages and unparseable Python (stay strict).
Reapply write failures (hard-fail, no marker).

Cross-repo

This is the CLI root-cause fix for promptdriven/pdd_cloud#2208. The cloud picks it up by bumping its pinned PDD_SOURCE_SHA; it can optionally surface the PDD_TEST_CHURN_ACCEPTED marker to tell users a test was substantially (but safely) rewritten.

Testing

tests/test_one_session_sync.py: +19 tests — accept on coverage-preserving rewrite (Python canonical, alt-path multi-file, TSX end-to-end); reject on dropped cases / gutted assertions / deletion / env-disable / write-failure / partial-write-failure; _count_test_units unit coverage (AST comment/string immunity, unparseable→None, JS method-call exclusion, chained modifiers, URL-in-string, template ${} interpolation, regex-literal exclusion).
Full tests/test_one_session_sync.py (104), test_code_generator_main.py churn gate (30), test_sync_orchestration.py churn (7), test_sync_main.py (95) all green.

Residual risk

TS/JS counting is a best-effort scanner (perfect counting needs a JS parser). Remaining false-accept paths are third-order and require adversarially-crafted input (e.g. an assertion dropped while a same-count expect( is hidden after a return-position regex), which a non-adversarial LLM rewrite does not produce. Measurement is symmetric, the result is human-reviewed in a PR with CI as the pass/fail backstop, and the env kill-switch disables the behavior entirely.

🤖 Generated with Claude Code

The one-session test-churn gate blocks purely on textual line churn (difflib ratio > threshold) and runs only after the agentic session already succeeded. When a prompt/code change is substantial the regenerated test is ~100% different yet correct; the repair-directive retry cannot reduce genuine churn, so sync hard-fails an otherwise successful result and forces manual intervention. On churn exhaustion in one-session sync, accept the rewrite instead of hard-failing IFF it preserves coverage: every guarded pre-existing test file (canonical + alt-path) keeps at least as many test cases AND assertions, carries >=1 real assertion, deletes nothing, and is in a language we can measure (Python via ast; TS/JS via a single-pass comment/string/regex/template-aware scanner). Unknown languages, deletions, shrinking rewrites, and assertion-free rewrites keep the strict gate, so the coverage-destruction protection is unchanged. The agent's candidate bytes are reapplied (the retry restore reverts them); if any reapply write fails, the partial reapply is rolled back to pre-session bytes and the strict gate hard-fails. A PDD_TEST_CHURN_ACCEPTED marker is emitted; PDD_DISABLE_TEST_CHURN_AUTOACCEPT forces the strict gate. The sync_orchestration parallel churn loop is left strict. Keep the one_session_sync prompt + README in sync with the new behavior so a future regen does not revert the fix. Fixes promptdriven/pdd_cloud#2208 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Serhan-Asad · 2026-06-19T01:27:22Z

Independent verification — done in an isolated worktree

Ran a full verification pass on this branch (separate worktree + venv, dedicated checkout at the PR head). Summary: the change is sound, narrowly scoped, and behaves exactly as described — including a live agentic run.

Deterministic verification

Check	Result
CI / rebase	All checks green (incl. the ~17-min unit job); base has no commits touching `one_session_sync.py` since branch point, so no rebase conflict
Full suite	755 / 755 pass across `test_one_session_sync.py`, `test_code_generator_main.py` (churn gate), `test_sync_orchestration.py` (churn), `test_sync_main.py`
Tests are load-bearing	Neutralizing `_accept_churn_rewrite_on_exhaustion` → `return False` failed exactly the 5 accept/rollback tests; every reject / strict-gate / kill-switch / counter test still passed. The new tests genuinely exercise the new code, and the strict gate remains the fallback.
Adversarial counter probe	17/17 hand-checked ground-truth cases pass; `_candidate_preserves_coverage` correctly rejects dropped-case, dropped-assertion, gutted (zero-assertion), deleted, and unmeasurable-language rewrites
Real-world data	The production counter on a real 1604-line `.tsx` test from the motivating scenario returns (239 cases, 523 assertions) — the case count matches a naive grep exactly, confirming the scanner tracks real, well-formatted TSX accurately
Strict path preserved	The parallel `sync_orchestration` churn loop is not in the diff and stays strict by construction; the existing exhaustion test still hard-fails (its fixtures are assertion-free, which the new `post_asserts >= 1` guard rejects)

Live agentic end-to-end

Drove the real run_one_session_sync with a live Claude agent (real claude -p, staging OAuth credential) generating the test rewrite through the same run_agentic_task seam the unit tests stub. A low churn threshold forced the churn path so the real accept/reject behavior could be observed on genuine agent-generated TSX:

Auto-accept ON: the agent produced a coverage-preserving high-churn rewrite (ratio 0.95, cases/asserts 3→3). Result: success, PDD_TEST_CHURN_ACCEPTED marker emitted, PDD_PHASE: synced, agent's rewrite kept on disk. ✅
Kill-switch ON (PDD_DISABLE_TEST_CHURN_AUTOACCEPT=1): same rewrite → raises TestChurnError, file restored to pre-session bytes, PDD_PHASE: conflict. ✅

Residual risk (already documented in the PR)

The TS/JS scanner under-counts when an expect(/it( shares a line after a </tag> or division / (the regex heuristic consumes the line tail). This can in principle false-accept a coverage drop, but only with pathological single-line input that a non-adversarial LLM rewrite does not produce — real multi-line TSX counts correctly (confirmed on the real file above). Bounded by the kill-switch, downstream pdd verify / CI, and human review. Acceptable as documented; not a blocker.

Verdict: verified, no regressions found, merge-ready.

🤖 Generated with Claude Code

gltanaka

Do not merge as-is.

The underlying issue is real: one-session sync can hard-fail after a legitimate high-churn test rewrite, and the retry loop alone may not converge. But the TS/JS coverage-preserving gate is not safe enough to convert that hard-fail into success yet.

Blocker: the TSX counter can undercount baseline assertions in a common React-test shape where a JSX closing tag is followed by assertions on the same line. Repro on this branch:

from pdd.one_session_sync import _count_test_units, _candidate_preserves_coverage

pre = "it(\"x\", () => { render(<Provider><Widget /></Provider>); expect(a).toBe(1); expect(b).toBe(2); });\n"
post = "it(\"x\", () => { expect(a).toBe(1); });\n"

print(_count_test_units(pre, "tests/widget.test.tsx"))   # (1, 0)
print(_count_test_units(post, "tests/widget.test.tsx"))  # (1, 1)
print(_candidate_preserves_coverage(pre, post, "tests/widget.test.tsx"))  # True

_strip_js_noncode treats the slash in </Provider> as a regex literal opener and strips the rest of the line, so the existing assertions after the JSX closing tag disappear from the baseline count. A candidate with fewer real assertions can then pass _candidate_preserves_coverage and be auto-accepted. This is not just an adversarial case; compact render(...); expect(...) TSX tests are normal, and TSX is the motivating language for this PR.

Required changes before merge:

Make the TS/JS counter JSX-aware enough to handle closing tags and same-line assertions, including render(<Provider><Widget /></Provider>); expect(...).
Add a regression test at the _candidate_preserves_coverage level showing that dropping assertions from that TSX shape is rejected.
If robust TSX counting is too much for this PR, keep TS/JS on the strict churn gate and limit auto-accept to languages with reliable measurement.

Until then, this can silently accept coverage loss while reporting PDD_PHASE: synced, which is exactly the class of failure the churn gate exists to prevent.

The TS/JS coverage counter treated the slash in a JSX closing tag (</Provider>) and a self-closing tag (<Widget count={n} />) as a regex opener, so _strip_js_noncode stripped the rest of the line and dropped any assertions written after the tag on the SAME line. That undercounted the baseline (e.g. render(<Provider><Widget /></Provider>); expect(a); expect(b); measured 0 assertions) and could auto-accept a rewrite that actually drops assertions from a compact React test — the exact coverage loss the churn gate exists to prevent. Fix: exclude the JSX tag-boundary chars < and > from the regex-opener set and never treat /> as a regex opener, so same-line assertions after a closing or self-closing tag are counted. Regex-literal stripping (/expect(/, getByText(/loading/i), toMatch(/foo/), split(/,/)) is preserved. Adds regression tests at both the counter and _candidate_preserves_coverage levels for the closing-tag and self-closing-with-expr-attr shapes, including that dropping assertions from that TSX shape is now rejected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Serhan-Asad · 2026-06-19T02:41:23Z

Thanks — reproduced your exact case and fixed it at the counter. You were right that this is a normal compact-React shape, not just adversarial.

Root cause: _strip_js_noncode treated the slash in a JSX closing tag (</Provider>) as a regex-literal opener — < was in the regex-opener char set — so it stripped the rest of the line and the assertions written after the tag on the same line vanished from the baseline count. The self-closing form (<Widget count={n} />) had the same problem via the } before />.

Fix (pushed in 11be289c):

Removed the JSX tag-boundary chars < and > from _REGEX_PREV_CHARS (a real regex literal essentially never follows a </> comparison in test code; JSX tags constantly do).
Added an explicit branch so a / immediately followed by > (self-closing />) is kept as code, regardless of the preceding char.
Regex-literal stripping is preserved for the genuine cases: /expect(/, getByText(/loading/i), toMatch(/foo/), split(/,/) all still count correctly.

Your repro now:

pre  = 'it("x", () => { render(<Provider><Widget /></Provider>); expect(a).toBe(1); expect(b).toBe(2); });\n'
post = 'it("x", () => { expect(a).toBe(1); });\n'
_count_test_units(pre,  "tests/widget.test.tsx")   # (1, 2)   <- was (1, 0)
_count_test_units(post, "tests/widget.test.tsx")   # (1, 1)
_candidate_preserves_coverage(pre, post, "tests/widget.test.tsx")  # False  <- was True

Regression tests added (both levels you asked for):

TestCountTestUnits::test_tsx_closing_tag_does_not_swallow_same_line_assertions and ::test_tsx_self_closing_tag_with_expr_attr_keeps_assertions — counter-level, for </Provider> and <Widget count={n} /> shapes.
TestTestChurnAcceptOnExhaustion::test_tsx_dropping_assertions_after_closing_tag_is_rejected — _candidate_preserves_coverage-level, asserting that dropping assertions from that exact TSX shape is now rejected (and a genuine coverage-preserving rewrite of the same shape is accepted).

Verification: full test_one_session_sync + test_code_generator_main + test_sync_orchestration + test_sync_main = 758 passed; the regex-protection tests still pass; and a live agentic run (real claude rewriting a TSX test) accepts a coverage-preserving rewrite with the PDD_TEST_CHURN_ACCEPTED marker while the kill-switch path still hard-fails. The narrow residual is now limited to genuinely pathological input (e.g. a regex literal containing the literal text expect( placed right after a </>), well outside what a real LLM rewrite produces.

gltanaka

The update resolves my previous blocker. I re-ran the exact TSX false-accept repro and it now counts the baseline as (1, 2), the candidate as (1, 1), and _candidate_preserves_coverage(...) returns False as required. The new regression tests cover both the counter and decision levels, including the compact render(<Provider><Widget /></Provider>); expect(...) shape.

Local verification on the updated PR:

conda run -n pdd pytest tests/test_one_session_sync.py -q -> 108 passed
conda run -n pdd pytest tests/test_code_generator_main.py tests/test_sync_orchestration.py tests/test_sync_main.py -q -> 650 passed

This still relies on a coverage-count heuristic rather than semantic equivalence, but the remaining risk is bounded: it only runs after one-session churn recovery is exhausted, refuses deletions/shrinking/unmeasurable rewrites, emits PDD_TEST_CHURN_ACCEPTED, and has PDD_DISABLE_TEST_CHURN_AUTOACCEPT as a strict-gate kill switch. Merge makes sense now.

Serhan-Asad force-pushed the fix/issue-2208-test-churn-autorecover branch from 5efb7ac to 405660f Compare June 19, 2026 00:49

Merge branch 'main' into fix/issue-2208-test-churn-autorecover

53ad3b1

gltanaka requested changes Jun 19, 2026

View reviewed changes

Merge branch 'main' into fix/issue-2208-test-churn-autorecover

9612573

gltanaka approved these changes Jun 19, 2026

View reviewed changes

gltanaka added 2 commits June 18, 2026 22:17

Merge branch 'main' into fix/issue-2208-test-churn-autorecover

4d38bfb

Merge branch 'main' into fix/issue-2208-test-churn-autorecover

54d2423

gltanaka merged commit 033f32a into main Jun 19, 2026
9 checks passed

gltanaka deleted the fix/issue-2208-test-churn-autorecover branch June 19, 2026 05:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-recover one-session sync from legitimate test churn#1661

Auto-recover one-session sync from legitimate test churn#1661
gltanaka merged 6 commits into
mainfrom
fix/issue-2208-test-churn-autorecover

Serhan-Asad commented Jun 19, 2026

Uh oh!

Serhan-Asad commented Jun 19, 2026

Uh oh!

gltanaka left a comment

Uh oh!

Serhan-Asad commented Jun 19, 2026

Uh oh!

gltanaka left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Serhan-Asad commented Jun 19, 2026

Problem

Fix

Why coverage counts (not textual churn)

Safety / what still hard-fails

Cross-repo

Testing

Residual risk

Uh oh!

Serhan-Asad commented Jun 19, 2026

Independent verification — done in an isolated worktree

Deterministic verification

Live agentic end-to-end

Residual risk (already documented in the PR)

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

Serhan-Asad commented Jun 19, 2026

Uh oh!

gltanaka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants