Skip to content

Guard orchestrator/session-driven task_start requeues from hold without breaking explicit human restarts #231

@fujiwaranosai850

Description

@fujiwaranosai850

Summary

Issue #227 fixed one stale-snapshot / heartbeat-health race, but the remaining PhysioLink #74 loop is still reproducible through a different path.

The remaining workflow.requeue events are coming from explicit task_start calls issued by the orchestrator session, not from heartbeat label repair.

Root cause found

I traced the live workflow.requeue events for PhysioLink issue #74 back to the main PhysioLink Telegram orchestrator session transcript:

Session file:

  • ~/.openclaw/agents/devclaw/sessions/7f4d6a98-186a-4ece-a800-31573a5a89f6-topic-1.jsonl

Confirmed task_start calls for issue #74:

  • 2026-05-07T01:28:52.215Z -> task_start(issueId=74, level=senior) from Planning -> To Do
  • 2026-05-08T02:58:37.127Z -> task_start(issueId=74, level=junior) from Refining -> To Do
  • 2026-05-08T03:22:49.663Z -> task_start(issueId=74, level=medior) from Refining -> To Do

The 02:58 and 03:22 cases are the remaining loop path.

Example transcript excerpt for the 02:58 requeue:

  • user: queue it back for the developer.
  • assistant: I’m sending #74 back to the developer queue for a conflict-resolution pass.
  • tool call: task_start { channelId: "-1003718418486", messageThreadId: 1, issueId: 74, level: "junior" }
  • tool result: from: "Refining", to: "To Do", transitioned: true

Example transcript excerpt for the 03:22 requeue:

Important additional finding

There is also a watcher/cron instruction in that same session lane that explicitly tells the orchestrator to requeue #74 from hold when it decides the issue is ready:

If the issue is ready to be requeued for developer work, requeue it with task_start(issueId=74, level=medior).

So the still-live loop family is not heartbeat health repair. It is the shared explicit restart machinery exposed through orchestrator/session behavior, including possible automation/watcher flows, which records normalized workflow.requeue with source: "system".

Why this needs its own fix

We still need normal intentional human restarts from hold to work.

What appears to be missing is a guardrail that distinguishes:

  • explicit, operator-intended restart from hold
  • orchestrator/session automation or ambiguous assistant follow-through that should not silently restart a blocked issue

Scope

Investigate and fix the remaining explicit requeue path around:

  • lib/tools/tasks/task-start.ts
  • any shared restart / queue-transition helpers used by orchestrator or cron/session automation
  • any watcher, intervention, or operator-follow-through paths that can invoke task_start on a hold-state issue after a blocked worker result
  • event/source semantics for workflow.requeue so live diagnostics can distinguish manual restart vs automated/system restart

Acceptance criteria

  • Identify the exact code and behavior path that allows orchestrator/session automation to move a held issue from Refining -> To Do via task_start after a blocked worker result.
  • Prevent unintended/automatic requeues from that path.
  • Preserve normal explicit human restart behavior for held issues.
  • Add regression coverage for the explicit workflow.requeue after hold case, not just stale heartbeat snapshot behavior.
  • Verify the fix against the PhysioLink Stop poisoned dev/review loops and enforce recovery when PR branches become structurally invalid #74 scenario in the live environment.

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions