fix(engine): re-evaluate steps before concluding a deadlock#632
Open
rclsilver wants to merge 1 commit into
Open
fix(engine): re-evaluate steps before concluding a deadlock#632rclsilver wants to merge 1 commit into
rclsilver wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes an engine deadlock edge case where steps with :ANY dependencies could remain stuck in TODO (and the resolution incorrectly end in BLOCKED_DEADLOCK) when their direct parent settles before its transitive ancestors have fully settled. The fix makes deadlock detection sound by performing a one-time full re-evaluation of step eligibility before finalizing a deadlock outcome.
Changes:
- Add a
recheckDeadlockmechanism inresolve()to force a full step eligibility rescan before concludingBLOCKED_DEADLOCK. - Add a deterministic regression test template modeling the transitive-ancestor/early-prune race with an
:ANYdependent. - Add
TestAnyDependencyTransitivePruneto validate the resolution completes successfully (and does not deadlock).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| engine/engine.go | Adds a guarded full rescan of all steps when BLOCKED_DEADLOCK is detected, to catch missed :ANY transitive-ancestor eligibility changes. |
| engine/engine_test.go | Adds a regression test asserting the :ANY dependent runs to completion and the resolution ends DONE. |
| engine/templates_tests/anyDependencyTransitivePrune.yaml | Adds a new template reproducing the transitive prune race scenario deterministically. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
An ANY dependency depends transitively on its whole ancestor tree through parentDepsAreFinal(), but incremental step selection only reconsiders the direct dependents of a modified step (StepTreeIndex). When an ANY parent reaches a final state (e.g. pruned by cascade) before its own transitive ancestors do, the dependent is evaluated once, found ineligible, and never reconsidered once those ancestors settle. The resolution then ends in BLOCKED_DEADLOCK even though the step is actually runnable. Before declaring a deadlock, re-evaluate every step (full candidate scan), mirroring recheckWaiting and what a fresh resolution run already does. If a step is genuinely runnable, resume the loop; otherwise the deadlock stands. parentDepsAreFinal() is left untouched so it still blocks ANY dependents whose ancestors are in an error state (see TestIndirectDependencies). Add TestAnyDependencyTransitivePrune reproducing the race deterministically. Signed-off-by: Thomas Bétrancourt <thomas@betrancourt.net>
dd0478a to
435c504
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a
BLOCKED_DEADLOCKthat can occur when a step has an:ANYdependency on a stepthat is pruned (or otherwise reaches a final state) before its own transitive ancestors
settle. The dependent step gets stuck in
TODOforever even though it is actually runnable,and the resolution ends in
BLOCKED_DEADLOCK— a terminal state that is never retried.Root cause
An
:ANYdependency is considered met only when the parent step and its entire transitiveancestor tree are in a final state — this is what
parentDepsAreFinal()enforces(introduced in
9290d3cto prevent an:ANYdependent from running while an upstream step isin an error state).
However, the engine's incremental candidate selection only re-evaluates the direct
dependents of a modified step (via
StepTreeIndex). These two mechanisms are inconsistent:parentDepsAreFinal()creates a transitive dependency, but re-evaluation is only direct.So when an
:ANYparent reaches a final state early — typically pruned by cascade from asibling branch — while another branch of a join/fan-in ancestor is still pending:
:ANYdependent is evaluated once, at the moment its direct parent becomes final.parentDepsAreFinal()returnsfalsebecause a transitive ancestor is stillTODO, so thedependent stays
TODO.the
:ANYdependent, whose only direct parent already settled and is no longer "modified".TODOstep remains, and the resolution is declaredBLOCKED_DEADLOCK.Minimal example
trigger → DONE (prunes coldStep)
coldStep → PRUNE
└─ cascades: joinStep → deployStep (both PRUNE)
slowDep → DONE → lateStep → DONE (sibling branch of joinStep, finishes late)
joinStep depends on [coldStep, lateStep]
deployStep depends on [joinStep]
unlockStep depends on [deployStep:ANY]
deployStepis pruned (cascade fromcoldStep) whilelateStepis still pending, so whenunlockStepis evaluatedparentDepsAreFinal()fails. By the timelateStepfinishes,unlockStepis never reconsidered → deadlock, even thoughunlockStepshould run.Fix
Make deadlock detection sound: before concluding
BLOCKED_DEADLOCK, perform one fullre-evaluation of all steps (empty
modifiedSteps⇒ every step is a candidate), mirroring theexisting
recheckWaitingmechanism and exactly what a fresh resolution run already does atstartup. If a step is genuinely runnable, the loop resumes; otherwise the deadlock is real and
is declared as before.
recheckDeadlockis reset totruewhenever a step settles, so chains of:ANYdependentsare covered. The number of rescans is bounded by the step count, so there is no infinite loop.
BLOCKED_DEADLOCK(noerror/waiting/crashed step), so error paths are unaffected.
parentDepsAreFinal()is left untouched — it still correctly blocks:ANYdependentswhose ancestors are in an error state.
Tests
Added
TestAnyDependencyTransitivePrune(+anyDependencyTransitivePrune.yaml) which modelsthe graph above deterministically: an early-pruned branch, a join step, a late-finishing
sibling branch, and an
:ANYdependent.unlockStep=TODO, resolution =BLOCKED_DEADLOCK(reproduces the bug).unlockStep=DONE, resolution =DONE../engine/...suite passes, includingTestIndirectDependenciesandTestAnyDependency(no regression);
go vetclean.Notes
This prevents future deadlocks of this kind. A resolution already stuck in
BLOCKED_DEADLOCKstill needs a manual resume/run — its steps will become eligible on the nextpass now that the dependency tree is fully final.