Skip to content

fix(engine): re-evaluate steps before concluding a deadlock#632

Open
rclsilver wants to merge 1 commit into
masterfrom
fix/engine-any-dependency-transitive-prune-deadlock
Open

fix(engine): re-evaluate steps before concluding a deadlock#632
rclsilver wants to merge 1 commit into
masterfrom
fix/engine-any-dependency-transitive-prune-deadlock

Conversation

@rclsilver

Copy link
Copy Markdown
Contributor

Summary

Fixes a BLOCKED_DEADLOCK that can occur when a step has an :ANY dependency on a step
that is pruned (or otherwise reaches a final state) before its own transitive ancestors
settle. The dependent step gets stuck in TODO forever even though it is actually runnable,
and the resolution ends in BLOCKED_DEADLOCK — a terminal state that is never retried.

Root cause

An :ANY dependency is considered met only when the parent step and its entire transitive
ancestor tree
are in a final state — this is what parentDepsAreFinal() enforces
(introduced in 9290d3c to prevent an :ANY dependent from running while an upstream step is
in an error state).

However, the engine's incremental candidate selection only re-evaluates the direct
dependents
of a modified step (via StepTreeIndex). These two mechanisms are inconsistent:
parentDepsAreFinal() creates a transitive dependency, but re-evaluation is only direct.

So when an :ANY parent reaches a final state early — typically pruned by cascade from a
sibling branch — while another branch of a join/fan-in ancestor is still pending:

  1. The :ANY dependent is evaluated once, at the moment its direct parent becomes final.
  2. parentDepsAreFinal() returns false because a transitive ancestor is still TODO, so the
    dependent stays TODO.
  3. When those ancestors later finish, only their direct dependents are reconsidered — never
    the :ANY dependent, whose only direct parent already settled and is no longer "modified".
  4. The loop drains, a TODO step remains, and the resolution is declared BLOCKED_DEADLOCK.

Minimal example

trigger → DONE (prunes coldStep)
coldStep → PRUNE
└─ cascades: joinStep → deployStep (both PRUNE)

slowDep → DONE → lateStep → DONE (sibling branch of joinStep, finishes late)

joinStep depends on [coldStep, lateStep]
deployStep depends on [joinStep]
unlockStep depends on [deployStep:ANY]

deployStep is pruned (cascade from coldStep) while lateStep is still pending, so when
unlockStep is evaluated parentDepsAreFinal() fails. By the time lateStep finishes,
unlockStep is never reconsidered → deadlock, even though unlockStep should run.

Fix

Make deadlock detection sound: before concluding BLOCKED_DEADLOCK, perform one full
re-evaluation of all steps (empty modifiedSteps ⇒ every step is a candidate), mirroring the
existing recheckWaiting mechanism and exactly what a fresh resolution run already does at
startup. If a step is genuinely runnable, the loop resumes; otherwise the deadlock is real and
is declared as before.

  • recheckDeadlock is reset to true whenever a step settles, so chains of :ANY dependents
    are covered. The number of rescans is bounded by the step count, so there is no infinite loop.
  • The rescan is gated on the resolution status actually being BLOCKED_DEADLOCK (no
    error/waiting/crashed step), so error paths are unaffected.
  • parentDepsAreFinal() is left untouched — it still correctly blocks :ANY dependents
    whose ancestors are in an error state.

Tests

Added TestAnyDependencyTransitivePrune (+ anyDependencyTransitivePrune.yaml) which models
the graph above deterministically: an early-pruned branch, a join step, a late-finishing
sibling branch, and an :ANY dependent.

  • Without the fix: unlockStep = TODO, resolution = BLOCKED_DEADLOCK (reproduces the bug).
  • With the fix: unlockStep = DONE, resolution = DONE.
  • Full ./engine/... suite passes, including TestIndirectDependencies and TestAnyDependency
    (no regression); go vet clean.

Notes

This prevents future deadlocks of this kind. A resolution already stuck in
BLOCKED_DEADLOCK still needs a manual resume/run — its steps will become eligible on the next
pass now that the dependency tree is fully final.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an engine deadlock edge case where steps with :ANY dependencies could remain stuck in TODO (and the resolution incorrectly end in BLOCKED_DEADLOCK) when their direct parent settles before its transitive ancestors have fully settled. The fix makes deadlock detection sound by performing a one-time full re-evaluation of step eligibility before finalizing a deadlock outcome.

Changes:

  • Add a recheckDeadlock mechanism in resolve() to force a full step eligibility rescan before concluding BLOCKED_DEADLOCK.
  • Add a deterministic regression test template modeling the transitive-ancestor/early-prune race with an :ANY dependent.
  • Add TestAnyDependencyTransitivePrune to validate the resolution completes successfully (and does not deadlock).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
engine/engine.go Adds a guarded full rescan of all steps when BLOCKED_DEADLOCK is detected, to catch missed :ANY transitive-ancestor eligibility changes.
engine/engine_test.go Adds a regression test asserting the :ANY dependent runs to completion and the resolution ends DONE.
engine/templates_tests/anyDependencyTransitivePrune.yaml Adds a new template reproducing the transitive prune race scenario deterministically.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread engine/engine.go Outdated
An ANY dependency depends transitively on its whole ancestor tree through
parentDepsAreFinal(), but incremental step selection only reconsiders the
direct dependents of a modified step (StepTreeIndex). When an ANY parent
reaches a final state (e.g. pruned by cascade) before its own transitive
ancestors do, the dependent is evaluated once, found ineligible, and never
reconsidered once those ancestors settle. The resolution then ends in
BLOCKED_DEADLOCK even though the step is actually runnable.

Before declaring a deadlock, re-evaluate every step (full candidate scan),
mirroring recheckWaiting and what a fresh resolution run already does. If a
step is genuinely runnable, resume the loop; otherwise the deadlock stands.
parentDepsAreFinal() is left untouched so it still blocks ANY dependents
whose ancestors are in an error state (see TestIndirectDependencies).

Add TestAnyDependencyTransitivePrune reproducing the race deterministically.

Signed-off-by: Thomas Bétrancourt <thomas@betrancourt.net>
@rclsilver rclsilver force-pushed the fix/engine-any-dependency-transitive-prune-deadlock branch from dd0478a to 435c504 Compare June 10, 2026 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants