Auto-recover one-session sync from legitimate test churn#1661
Conversation
The one-session test-churn gate blocks purely on textual line churn (difflib ratio > threshold) and runs only after the agentic session already succeeded. When a prompt/code change is substantial the regenerated test is ~100% different yet correct; the repair-directive retry cannot reduce genuine churn, so sync hard-fails an otherwise successful result and forces manual intervention. On churn exhaustion in one-session sync, accept the rewrite instead of hard-failing IFF it preserves coverage: every guarded pre-existing test file (canonical + alt-path) keeps at least as many test cases AND assertions, carries >=1 real assertion, deletes nothing, and is in a language we can measure (Python via ast; TS/JS via a single-pass comment/string/regex/template-aware scanner). Unknown languages, deletions, shrinking rewrites, and assertion-free rewrites keep the strict gate, so the coverage-destruction protection is unchanged. The agent's candidate bytes are reapplied (the retry restore reverts them); if any reapply write fails, the partial reapply is rolled back to pre-session bytes and the strict gate hard-fails. A PDD_TEST_CHURN_ACCEPTED marker is emitted; PDD_DISABLE_TEST_CHURN_AUTOACCEPT forces the strict gate. The sync_orchestration parallel churn loop is left strict. Keep the one_session_sync prompt + README in sync with the new behavior so a future regen does not revert the fix. Fixes promptdriven/pdd_cloud#2208 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5efb7ac to
405660f
Compare
Independent verification — done in an isolated worktreeRan a full verification pass on this branch (separate worktree + venv, dedicated checkout at the PR head). Summary: the change is sound, narrowly scoped, and behaves exactly as described — including a live agentic run. Deterministic verification
Live agentic end-to-endDrove the real
Residual risk (already documented in the PR)The TS/JS scanner under-counts when an Verdict: verified, no regressions found, merge-ready. 🤖 Generated with Claude Code |
gltanaka
left a comment
There was a problem hiding this comment.
Do not merge as-is.
The underlying issue is real: one-session sync can hard-fail after a legitimate high-churn test rewrite, and the retry loop alone may not converge. But the TS/JS coverage-preserving gate is not safe enough to convert that hard-fail into success yet.
Blocker: the TSX counter can undercount baseline assertions in a common React-test shape where a JSX closing tag is followed by assertions on the same line. Repro on this branch:
from pdd.one_session_sync import _count_test_units, _candidate_preserves_coverage
pre = "it(\"x\", () => { render(<Provider><Widget /></Provider>); expect(a).toBe(1); expect(b).toBe(2); });\n"
post = "it(\"x\", () => { expect(a).toBe(1); });\n"
print(_count_test_units(pre, "tests/widget.test.tsx")) # (1, 0)
print(_count_test_units(post, "tests/widget.test.tsx")) # (1, 1)
print(_candidate_preserves_coverage(pre, post, "tests/widget.test.tsx")) # True_strip_js_noncode treats the slash in </Provider> as a regex literal opener and strips the rest of the line, so the existing assertions after the JSX closing tag disappear from the baseline count. A candidate with fewer real assertions can then pass _candidate_preserves_coverage and be auto-accepted. This is not just an adversarial case; compact render(...); expect(...) TSX tests are normal, and TSX is the motivating language for this PR.
Required changes before merge:
- Make the TS/JS counter JSX-aware enough to handle closing tags and same-line assertions, including
render(<Provider><Widget /></Provider>); expect(...). - Add a regression test at the
_candidate_preserves_coveragelevel showing that dropping assertions from that TSX shape is rejected. - If robust TSX counting is too much for this PR, keep TS/JS on the strict churn gate and limit auto-accept to languages with reliable measurement.
Until then, this can silently accept coverage loss while reporting PDD_PHASE: synced, which is exactly the class of failure the churn gate exists to prevent.
The TS/JS coverage counter treated the slash in a JSX closing tag
(</Provider>) and a self-closing tag (<Widget count={n} />) as a regex
opener, so _strip_js_noncode stripped the rest of the line and dropped
any assertions written after the tag on the SAME line. That undercounted
the baseline (e.g. render(<Provider><Widget /></Provider>); expect(a);
expect(b); measured 0 assertions) and could auto-accept a rewrite that
actually drops assertions from a compact React test — the exact coverage
loss the churn gate exists to prevent.
Fix: exclude the JSX tag-boundary chars < and > from the regex-opener set
and never treat /> as a regex opener, so same-line assertions after a
closing or self-closing tag are counted. Regex-literal stripping
(/expect(/, getByText(/loading/i), toMatch(/foo/), split(/,/)) is
preserved.
Adds regression tests at both the counter and _candidate_preserves_coverage
levels for the closing-tag and self-closing-with-expr-attr shapes,
including that dropping assertions from that TSX shape is now rejected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks — reproduced your exact case and fixed it at the counter. You were right that this is a normal compact-React shape, not just adversarial. Root cause: Fix (pushed in
Your repro now: pre = 'it("x", () => { render(<Provider><Widget /></Provider>); expect(a).toBe(1); expect(b).toBe(2); });\n'
post = 'it("x", () => { expect(a).toBe(1); });\n'
_count_test_units(pre, "tests/widget.test.tsx") # (1, 2) <- was (1, 0)
_count_test_units(post, "tests/widget.test.tsx") # (1, 1)
_candidate_preserves_coverage(pre, post, "tests/widget.test.tsx") # False <- was TrueRegression tests added (both levels you asked for):
Verification: full |
gltanaka
left a comment
There was a problem hiding this comment.
The update resolves my previous blocker. I re-ran the exact TSX false-accept repro and it now counts the baseline as (1, 2), the candidate as (1, 1), and _candidate_preserves_coverage(...) returns False as required. The new regression tests cover both the counter and decision levels, including the compact render(<Provider><Widget /></Provider>); expect(...) shape.
Local verification on the updated PR:
conda run -n pdd pytest tests/test_one_session_sync.py -q-> 108 passedconda run -n pdd pytest tests/test_code_generator_main.py tests/test_sync_orchestration.py tests/test_sync_main.py -q-> 650 passed
This still relies on a coverage-count heuristic rather than semantic equivalence, but the remaining risk is bounded: it only runs after one-session churn recovery is exhausted, refuses deletions/shrinking/unmeasurable rewrites, emits PDD_TEST_CHURN_ACCEPTED, and has PDD_DISABLE_TEST_CHURN_AUTOACCEPT as a strict-gate kill switch. Merge makes sense now.
Problem
pdd sync(one-session path) hard-fails when an LLM produces a large but legitimate test rewrite. The test-churn gate blocks purely on textual line churn (difflibratio >PDD_TEST_CHURN_THRESHOLD, default 0.40) and runs only after the agentic session already reported success. When a prompt/code change is substantial, the regenerated test is ~100% different yet correct; the repair-directive retry cannot reduce genuine churn, so after the bounded retries the gate hard-fails an otherwise-successful result and forces manual intervention.Observed in
promptdriven/pdd_cloud#2163(PR #2166): modulehackathon_organizer_event_detail_page(typescriptreact) failed repeatedly withTest churn threshold exceeded,churn ratio: 1, burning ~$8–10 per run before aborting.Fix
On churn exhaustion in one-session sync only (identical-signature short-circuit or the
_MAX_TEST_CHURN_ATTEMPTScap), accept the rewrite instead of hard-failing — but only when it preserves coverage. For every guarded pre-existing test file (canonical + alt-path snapshot), the candidate must:ast; TS/JS via a single-pass comment/string/regex/template scanner).Anything else — unknown language, deletion, shrinking, vacuous (assertion-free) rewrite, unparseable source — keeps the strict gate unchanged, so the coverage-destruction protection (#1012 / #1015) is intact. On accept, the agent's candidate bytes (reverted by the retry's restore) are reapplied to disk; if any reapply write fails, acceptance is refused (no false success). A
PDD_TEST_CHURN_ACCEPTED:marker + warning is emitted. The parallelsync_orchestration.pychurn loop is intentionally left strict.Kill switch:
PDD_DISABLE_TEST_CHURN_AUTOACCEPT=1forces the strict gate.Why coverage counts (not textual churn)
Textual line churn is a poor proxy: a coverage-preserving reorganization scores ~1.0. Counting test cases + assertions distinguishes a legitimate modernization (kept/added coverage) from silent coverage destruction (dropped cases/assertions), which is exactly what the gate exists to catch.
Safety / what still hard-fails
Cross-repo
This is the CLI root-cause fix for
promptdriven/pdd_cloud#2208. The cloud picks it up by bumping its pinnedPDD_SOURCE_SHA; it can optionally surface thePDD_TEST_CHURN_ACCEPTEDmarker to tell users a test was substantially (but safely) rewritten.Testing
tests/test_one_session_sync.py: +19 tests — accept on coverage-preserving rewrite (Python canonical, alt-path multi-file, TSX end-to-end); reject on dropped cases / gutted assertions / deletion / env-disable / write-failure / partial-write-failure;_count_test_unitsunit coverage (AST comment/string immunity, unparseable→None, JS method-call exclusion, chained modifiers, URL-in-string, template${}interpolation, regex-literal exclusion).tests/test_one_session_sync.py(104),test_code_generator_main.pychurn gate (30),test_sync_orchestration.pychurn (7),test_sync_main.py(95) all green.Residual risk
TS/JS counting is a best-effort scanner (perfect counting needs a JS parser). Remaining false-accept paths are third-order and require adversarially-crafted input (e.g. an assertion dropped while a same-count
expect(is hidden after areturn-position regex), which a non-adversarial LLM rewrite does not produce. Measurement is symmetric, the result is human-reviewed in a PR with CI as the pass/fail backstop, and the env kill-switch disables the behavior entirely.🤖 Generated with Claude Code