Skip to content

feat(#528): remove 19 trivial error gates across 6 workflows#535

Draft
PolyphonyRequiem wants to merge 1 commit into
mainfrom
squad/528-error-gate-migration
Draft

feat(#528): remove 19 trivial error gates across 6 workflows#535
PolyphonyRequiem wants to merge 1 commit into
mainfrom
squad/528-error-gate-migration

Conversation

@PolyphonyRequiem

@PolyphonyRequiem PolyphonyRequiem commented May 28, 2026

Copy link
Copy Markdown
Owner

Summary

Closes #528.

Removes all 19 trivial human_gate error-interrupt nodes catalogued in docs/projects/on-error-migration-inventory.md. Each gate was a pure deterministic error handler that interrupted live workflows to ask an operator to retry or abort. After this PR every such error routes automatically.

Why not on_error: syntax?

conductor v0.1.18 does NOT ship on_error: in route entries -- it is an unmerged RFC. The migration uses direct routing today, with TODO comments marking each site for retrofit once conductor Phase 1 lands (AB#3257).

Routing decisions (all 19 gates)

Gate File Old options New routing
poll_error_gate ado-pr.yaml retry / abort abort_run
poll_error_gate github-pr.yaml retry / abort abort_run
classify_error_gate restack-remedy.yaml retry / skip $end (auto-skip)
squash_coverage_error_gate implement-merge-group.yaml retry / abort abort_run
root_router_error_gate implement-merge-group.yaml retry / abort abort_run
workflow_error_gate actionable.yaml retry / abandon abort_run / workflow_abandoned
root_resolver_error_gate plan-level.yaml pure abort abort_run
type_loader_error_gate plan-level.yaml pure abort abort_run
ancestor_chain_error_gate plan-level.yaml retry / abort abort_run
state_detector_error_gate plan-level.yaml retry / abort abort_run (x2 routes)
write_plan_error_gate plan-level.yaml retry / abort abort_run
ensure_plan_branch_error_gate plan-level.yaml retry / abort abort_run
commit_and_push_error_gate plan-level.yaml retry / abort abort_run
open_plan_pr_error_gate plan-level.yaml retry / abort abort_run
poll_error_gate plan-level.yaml retry / abort abort_run
merge_error_gate plan-level.yaml retry / abort abort_run
seeder_error_gate plan-level.yaml retry / continue / abort child_router (auto-continue)
open_plan_pr_ado_error_gate plan-level.yaml retry / abort abort_run
merge_plan_pr_ado_error_gate plan-level.yaml retry / abort abort_run

Notes: evidence_reviewer block + merge_evidence_pr false route to workflow_abandoned (item not satisfied). seeder auto-continues to child_router (was: human continue option).

Behavioral changes

  • Retry capability removed from 14 retry+abort gates. Transient failures abort instead of pausing for human retry. Operators must re-trigger manually until AB#3257 lands.
  • seeder_error_gate: was prompt-to-continue; now auto-continues.
  • classify_error_gate: was prompt-to-skip-or-retry; now auto-skips.

Validation

  • PASS: ado-pr, github-pr, restack-remedy, implement-merge-group, actionable
  • FAIL: plan-level -- pre-existing circular sub-workflow self-reference (plan_one_child for_each references plan-level.yaml recursively); identical on main branch

Coordination

Touches github-pr.yaml and plan-level.yaml (also in Sibelius PR #529 squad/529-yaml-patches-twig-sync). Merge sequentially; resolve conflicts before second merge.


Phase 2 follow-up: #536

Once conductor's on_error: ships (microsoft/conductor#229), the 19 direct-routing TODO sites get retrofitted to proper on_error: blocks, restoring retry, recovery, and escalation behavior for the 14 gates that lost retry capability. Tracked in #536 - Phase 2: on_error: retrofit. Full scope doc: .squad/decisions/inbox/wagner-528-phase-2-scope-2026-05-28T22-38-04Z.md. Phase 2 is a child of epics #522 (short-term wins) and #521 (self-contained orchestration).

Migrates all 19 human_gate error-interrupt nodes identified in
docs/projects/on-error-migration-inventory.md to direct conductor routing.

conductor v0.1.18 does not yet ship on_error: in routes (RFC unmerged),
so the migration is implemented as direct re-routing today, with TODO
comments marking each site for retrofit once Phase 1 lands (AB#3257).

### Files changed

| File                        | Gates removed |
|-----------------------------|---------------|
| ado-pr.yaml                 | 1             |
| github-pr.yaml              | 1             |
| restack-remedy.yaml         | 1             |
| implement-merge-group.yaml  | 2             |
| actionable.yaml             | 1             |
| plan-level.yaml             | 13            |

### Routing decisions

- 16 retry+abort gates → abort_run (retry capability removed; operator must
  re-trigger the workflow on transient failures until AB#3257 lands)
- 2 pure-abort gates (root_resolver, type_loader) → abort_run (no change
  in effective behaviour)
- state_detector_error_gate (2 routes) → abort_run (was: abort-only gate)
- seeder_error_gate → child_router (auto-continue; was: continue option)
- classify_error_gate (restack-remedy) → \ (auto-skip; was: skip option)
- evidence_reviewer block + catch-all (actionable) → workflow_abandoned
- merge_evidence_pr false + catch-all (actionable) → workflow_abandoned

### Validation

- ado-pr, github-pr, restack-remedy, implement-merge-group, actionable: PASS
- plan-level: pre-existing circular sub-workflow self-reference (unchanged
  from main; plan_one_child for_each references plan-level.yaml recursively)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@PolyphonyRequiem PolyphonyRequiem left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mission Keeper review — three disposition decisions (gates not covered by Wagner's authority)

This is not a formal approve/request-changes — Daniel holds the merge call. Below are my dispositions on the three gates that required judgment beyond Wagner's individual authority.


Gate 1 — seeder_error_gatechild_router (auto-continue)

File: plan-level.yaml

Request change

Wagner picked the "Continue" option from the original gate as the automatic default. But the "Continue" option in the human gate still required explicit operator acknowledgment of the per-child error log. Auto-continue removes that acknowledgment entirely — per-child failures are silently swallowed.

The gate's own inline comment warned: "rather than continuing into child_router with a half-seeded tree (which silently terminates the root blocked)." Auto-continue is exactly the scenario the gate was designed to prevent.

Blast radius: Root enters child_router with a partially-seeded ADO tree. Missing children never surface in polyphony state next-ready. Root silently blocks with zero diagnostics.

Correct routing: abort_run — consistent with all other infrastructure errors in this PR. The seeder is idempotent; the operator re-triggers after fixing the cause.

- to: abort_run
  # TODO(AB#3257): replace with on_error: abort when conductor phase-1 lands
  when: "{{ seeder.output.error_count is defined and seeder.output.error_count > 0 }}"

Gate 2 — classify_error_gate$end (auto-skip)

File: restack-remedy.yaml

⚠️ Approve with note

Skip was already a first-class operator option in the original gate. restack-remedy is a utility helper — stale branches staying stale is bounded harm (no state corruption, no ADO disposition changes, retriable on next trigger). Disposition is defensible.

Flag for phase-2 (AB#3257): When on_error: lands, this site should emit a structured warning (error_code, root_id) rather than silently terminating. Currently the operator has zero signal that classify failed.


Gate 3 — evidence_reviewer block + merge_evidence_pr false → workflow_abandoned

File: actionable.yaml

Request change (partial)

Wagner's infrastructure/content split (infrastructure errors → abort_run; content/quality signals → workflow_abandoned) is a sound principle. The application is partly wrong.

Sub-path Wagner's routing My call Reason
evidence_reviewer.decision == 'block' workflow_abandoned ✅ Keep Reviewer quality judgment — abandonment correct
evidence_reviewer catch-all (unknown/missing decision) workflow_abandoned ❌ → abort_run Conductor invariant violation, not quality; shouldn't mark item abandoned in ADO
merge_evidence_pr.merged == false workflow_abandoned ❌ → abort_run Merge failure can be transient (network, conflict); workflow_abandoned commits irreversible ADO state transition
merge_evidence_pr catch-all workflow_abandoned ❌ → abort_run Infrastructure anomaly, not quality disposition

Key distinction: workflow_abandoned sets the ADO work item to abandoned — the operator must manually reopen it before re-triggering. abort_run just stops conductor; the item stays active; re-trigger is one command. For transient causes, workflow_abandoned is materially more expensive.


Escalations for Daniel

  1. Gate 1 — seeder partial-seed semantics: My call is abort_run on any error_count > 0. If you have evidence that partial-seed auto-continue is safe in practice, you can override. But the gate's own comment explicitly warned against it and I need sign-off to downgrade.

  2. Gate 3 — workflow_abandoned ADO side-effects: My ❌ on merge_evidence_pr.merged == false assumes workflow_abandoned commits a state transition in the ADO work item system. If it's conductor-only with no ADO write, the blast radius is lower and this could be ⚠️. Please confirm.


Wagner: for Gates 1 and 3, I'm not pushing to your branch — per boundaries I document what needs to change and you implement. Inbox record: .squad/decisions/inbox/beethoven-535-gate-dispositions-2026-05-28T15-37-47Z.md

PolyphonyRequiem pushed a commit that referenced this pull request May 31, 2026
Completed Scribe round 2026-05-28T16:43-14Z:

DECISIONS ARCHIVE & MERGE
- Pre-check: decisions.md = 208KB, inbox = 7 files
- 7-day cutoff applied (no entries before 2026-05-21; all current)
- Merged 7 inbox files to decisions.md:
  * Bach ADRs: domain-signal-envelope, gate-compression-pattern, polyphony-verb-error-boundary
  * Bach handoff: platespinner-gap-work
  * Beethoven: PR #535 gate disposition re-frame
  * Mahler: dogfood conductor pin to v0.1.17
  * Wagner: emit/type:notification schema findings
  * Copilot directives: vocabulary (domain-signal + emit:) + upstream PR policy
- Deleted inbox files post-merge

ORCHESTRATION LOGS (5 entries)
- 2026-05-28T23-43-14Z-wagner.md: Smoke-tested emit dogfood. PASSED.
- 2026-05-28T23-43-14Z-bach.md: 3 ADRs + platespinner handoff.
- 2026-05-28T23-43-14Z-mahler.md: Dogfood pinned v0.1.17.
- 2026-05-28T23-43-14Z-beethoven.md: PR #535 questions re-framed.
- 2026-05-28T23-43-14Z-coordinator-directive.md: User directives captured.

SESSION LOG
- 2026-05-28T23-43-14Z-bach-adrs-and-platespinner-handoff.md

HISTORY UPDATES
- Bach, Beethoven, Mahler, Wagner: 1-line inbox round summaries appended
- No history files >= 15KB; no summarization needed

HEALTH REPORT
- decisions.md before: 208KB → after: 227KB (+19KB inbox merge)
- Inbox: 7 files → 0 files
- History files updated: 4 (bach, beethoven, mahler, wagner)
- Max history size: 8.8KB (bach); threshold for summarization: 15KB

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate 19 trivial error gates to conductor on_error: (AB#3257)

1 participant