Skip to content

fix(runner): bound pollUntilDone for wedged runs#19

Merged
enixCode merged 1 commit into
mainfrom
fix/poll-deadline
Jun 2, 2026
Merged

fix(runner): bound pollUntilDone for wedged runs#19
enixCode merged 1 commit into
mainfrom
fix/poll-deadline

Conversation

@enixCode
Copy link
Copy Markdown
Owner

@enixCode enixCode commented Jun 2, 2026

Problem

pollUntilDone looped forever (while (true)) while a run stayed running. A detached run that never finalized, the light-runner detached-extract hang fixed in light-runner 0.16.2, wedged the whole DAG: the node poll never returned.

Fix (belt-and-suspenders on the light-process side)

The root cause is fixed in light-runner 0.16.2. This adds a safety net so a stuck run fails the node instead of hanging the orchestrator:

  • bounded node (timeout > 0): deadline = timeout + 5 min grace. The grace covers image pull, extraction and teardown, which all run outside the container's own timeout window. Past the deadline, pollUntilDone throws; the existing executeNode catch turns it into a failed node result (no process crash).
  • timeout 0 (opt-out): keeps polling indefinitely, unchanged.

node.timeout is milliseconds end-to-end (passed straight through to light-run, which caps it at 3.6M ms = 1h), so the deadline math is in ms.

build with cc

pollUntilDone looped forever while a run stayed `running`. A detached run that
never finalized (the light-runner detached-extract hang, fixed in light-runner
0.16.2) wedged the whole DAG: the node poll never returned. Belt-and-suspenders
on the light-process side so a stuck run fails the node instead of hanging.

- bounded node (timeout > 0): deadline = timeout + 5 min grace (image pull,
  extraction and teardown all run outside the container's own timeout window);
  past it, throw, which the executeNode catch turns into a failed node result
- timeout 0 (opt-out): keep polling indefinitely, unchanged

build with cc
@enixCode enixCode merged commit 03bc5e0 into main Jun 2, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant