feat(#528): remove 19 trivial error gates across 6 workflows#535
feat(#528): remove 19 trivial error gates across 6 workflows#535PolyphonyRequiem wants to merge 1 commit into
Conversation
Migrates all 19 human_gate error-interrupt nodes identified in docs/projects/on-error-migration-inventory.md to direct conductor routing. conductor v0.1.18 does not yet ship on_error: in routes (RFC unmerged), so the migration is implemented as direct re-routing today, with TODO comments marking each site for retrofit once Phase 1 lands (AB#3257). ### Files changed | File | Gates removed | |-----------------------------|---------------| | ado-pr.yaml | 1 | | github-pr.yaml | 1 | | restack-remedy.yaml | 1 | | implement-merge-group.yaml | 2 | | actionable.yaml | 1 | | plan-level.yaml | 13 | ### Routing decisions - 16 retry+abort gates → abort_run (retry capability removed; operator must re-trigger the workflow on transient failures until AB#3257 lands) - 2 pure-abort gates (root_resolver, type_loader) → abort_run (no change in effective behaviour) - state_detector_error_gate (2 routes) → abort_run (was: abort-only gate) - seeder_error_gate → child_router (auto-continue; was: continue option) - classify_error_gate (restack-remedy) → \ (auto-skip; was: skip option) - evidence_reviewer block + catch-all (actionable) → workflow_abandoned - merge_evidence_pr false + catch-all (actionable) → workflow_abandoned ### Validation - ado-pr, github-pr, restack-remedy, implement-merge-group, actionable: PASS - plan-level: pre-existing circular sub-workflow self-reference (unchanged from main; plan_one_child for_each references plan-level.yaml recursively) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PolyphonyRequiem
left a comment
There was a problem hiding this comment.
Mission Keeper review — three disposition decisions (gates not covered by Wagner's authority)
This is not a formal approve/request-changes — Daniel holds the merge call. Below are my dispositions on the three gates that required judgment beyond Wagner's individual authority.
Gate 1 — seeder_error_gate → child_router (auto-continue)
File: plan-level.yaml
❌ Request change
Wagner picked the "Continue" option from the original gate as the automatic default. But the "Continue" option in the human gate still required explicit operator acknowledgment of the per-child error log. Auto-continue removes that acknowledgment entirely — per-child failures are silently swallowed.
The gate's own inline comment warned: "rather than continuing into child_router with a half-seeded tree (which silently terminates the root blocked)." Auto-continue is exactly the scenario the gate was designed to prevent.
Blast radius: Root enters child_router with a partially-seeded ADO tree. Missing children never surface in polyphony state next-ready. Root silently blocks with zero diagnostics.
Correct routing: abort_run — consistent with all other infrastructure errors in this PR. The seeder is idempotent; the operator re-triggers after fixing the cause.
- to: abort_run
# TODO(AB#3257): replace with on_error: abort when conductor phase-1 lands
when: "{{ seeder.output.error_count is defined and seeder.output.error_count > 0 }}"Gate 2 — classify_error_gate → $end (auto-skip)
File: restack-remedy.yaml
Skip was already a first-class operator option in the original gate. restack-remedy is a utility helper — stale branches staying stale is bounded harm (no state corruption, no ADO disposition changes, retriable on next trigger). Disposition is defensible.
Flag for phase-2 (AB#3257): When on_error: lands, this site should emit a structured warning (error_code, root_id) rather than silently terminating. Currently the operator has zero signal that classify failed.
Gate 3 — evidence_reviewer block + merge_evidence_pr false → workflow_abandoned
File: actionable.yaml
❌ Request change (partial)
Wagner's infrastructure/content split (infrastructure errors → abort_run; content/quality signals → workflow_abandoned) is a sound principle. The application is partly wrong.
| Sub-path | Wagner's routing | My call | Reason |
|---|---|---|---|
evidence_reviewer.decision == 'block' |
workflow_abandoned |
✅ Keep | Reviewer quality judgment — abandonment correct |
evidence_reviewer catch-all (unknown/missing decision) |
workflow_abandoned |
❌ → abort_run |
Conductor invariant violation, not quality; shouldn't mark item abandoned in ADO |
merge_evidence_pr.merged == false |
workflow_abandoned |
❌ → abort_run |
Merge failure can be transient (network, conflict); workflow_abandoned commits irreversible ADO state transition |
merge_evidence_pr catch-all |
workflow_abandoned |
❌ → abort_run |
Infrastructure anomaly, not quality disposition |
Key distinction: workflow_abandoned sets the ADO work item to abandoned — the operator must manually reopen it before re-triggering. abort_run just stops conductor; the item stays active; re-trigger is one command. For transient causes, workflow_abandoned is materially more expensive.
Escalations for Daniel
-
Gate 1 — seeder partial-seed semantics: My call is
abort_runon anyerror_count > 0. If you have evidence that partial-seed auto-continue is safe in practice, you can override. But the gate's own comment explicitly warned against it and I need sign-off to downgrade. -
Gate 3 —
workflow_abandonedADO side-effects: My ❌ onmerge_evidence_pr.merged == falseassumesworkflow_abandonedcommits a state transition in the ADO work item system. If it's conductor-only with no ADO write, the blast radius is lower and this could be⚠️ . Please confirm.
Wagner: for Gates 1 and 3, I'm not pushing to your branch — per boundaries I document what needs to change and you implement. Inbox record: .squad/decisions/inbox/beethoven-535-gate-dispositions-2026-05-28T15-37-47Z.md
Completed Scribe round 2026-05-28T16:43-14Z: DECISIONS ARCHIVE & MERGE - Pre-check: decisions.md = 208KB, inbox = 7 files - 7-day cutoff applied (no entries before 2026-05-21; all current) - Merged 7 inbox files to decisions.md: * Bach ADRs: domain-signal-envelope, gate-compression-pattern, polyphony-verb-error-boundary * Bach handoff: platespinner-gap-work * Beethoven: PR #535 gate disposition re-frame * Mahler: dogfood conductor pin to v0.1.17 * Wagner: emit/type:notification schema findings * Copilot directives: vocabulary (domain-signal + emit:) + upstream PR policy - Deleted inbox files post-merge ORCHESTRATION LOGS (5 entries) - 2026-05-28T23-43-14Z-wagner.md: Smoke-tested emit dogfood. PASSED. - 2026-05-28T23-43-14Z-bach.md: 3 ADRs + platespinner handoff. - 2026-05-28T23-43-14Z-mahler.md: Dogfood pinned v0.1.17. - 2026-05-28T23-43-14Z-beethoven.md: PR #535 questions re-framed. - 2026-05-28T23-43-14Z-coordinator-directive.md: User directives captured. SESSION LOG - 2026-05-28T23-43-14Z-bach-adrs-and-platespinner-handoff.md HISTORY UPDATES - Bach, Beethoven, Mahler, Wagner: 1-line inbox round summaries appended - No history files >= 15KB; no summarization needed HEALTH REPORT - decisions.md before: 208KB → after: 227KB (+19KB inbox merge) - Inbox: 7 files → 0 files - History files updated: 4 (bach, beethoven, mahler, wagner) - Max history size: 8.8KB (bach); threshold for summarization: 15KB Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Closes #528.
Removes all 19 trivial
human_gateerror-interrupt nodes catalogued indocs/projects/on-error-migration-inventory.md. Each gate was a pure deterministic error handler that interrupted live workflows to ask an operator to retry or abort. After this PR every such error routes automatically.Why not
on_error:syntax?conductor v0.1.18 does NOT ship
on_error:in route entries -- it is an unmerged RFC. The migration uses direct routing today, with TODO comments marking each site for retrofit once conductor Phase 1 lands (AB#3257).Routing decisions (all 19 gates)
Notes: evidence_reviewer block + merge_evidence_pr false route to workflow_abandoned (item not satisfied). seeder auto-continues to child_router (was: human continue option).
Behavioral changes
Validation
Coordination
Touches github-pr.yaml and plan-level.yaml (also in Sibelius PR #529 squad/529-yaml-patches-twig-sync). Merge sequentially; resolve conflicts before second merge.
Phase 2 follow-up: #536
Once conductor's
on_error:ships (microsoft/conductor#229), the 19 direct-routing TODO sites get retrofitted to properon_error:blocks, restoring retry, recovery, and escalation behavior for the 14 gates that lost retry capability. Tracked in #536 - Phase 2: on_error: retrofit. Full scope doc:.squad/decisions/inbox/wagner-528-phase-2-scope-2026-05-28T22-38-04Z.md. Phase 2 is a child of epics #522 (short-term wins) and #521 (self-contained orchestration).