fix(core): auto-terminate sessions when agent exits but runtime stays alive (closes #1933, #1966)#2041
Conversation
… alive When the agent process exits but tmux stays alive (keep-alive shell), resolveProbeDecision treated runtime=alive + process=dead as a signal disagreement, ran the detecting cycle, and parked the session at stuck/probe_failure forever. The native activity signal + process probe both agree the agent is gone, so terminate directly instead. Closes AgentWrapper#1933, AgentWrapper#1966. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Greptile SummaryThis PR fixes a long-standing bug where sessions would get permanently parked in
Confidence Score: 5/5Safe to merge — the change is tightly scoped to a single new guard in resolveProbeDecision, the spawning-session protection is correctly preserved via the processProbe.state=dead requirement, and the previously flagged processProbe.failed concern is addressed. The new early-return branch has a narrow, well-defined trigger (native activity signal + process probe must independently agree the agent exited), the processProbe.failed guard prevents probe-error false-positives, and the processProbe.state=dead gate keeps spawning sessions unaffected. All three new unit tests target distinct branches of the decision, the integration test correction is accurate, and the deriveLegacyStatus mapping is straightforward. No unguarded paths or unintended side-effects were found. No files require special attention.
|
| Filename | Overview |
|---|---|
| packages/core/src/lifecycle-status-decisions.ts | Core fix: new early-return branch in resolveProbeDecision terminates sessions on activity=exited + processProbe=dead; processProbe.failed guard prevents premature termination on probe errors. |
| packages/core/src/lifecycle-state.ts | Adds agent_process_exited → killed mapping to deriveLegacyStatus; agent_process_exited was already in CanonicalSessionReason, just not handled in the terminated switch. |
| packages/core/src/tests/lifecycle-status-decisions.test.ts | New resolveProbeDecision test suite covers the terminate-on-exit path, signal-disagreement bypass, and the fallback-to-detecting case when no exit signal is present. |
| packages/core/src/tests/lifecycle-manager.test.ts | Corrects a mislabeled integration test that previously expected detecting but should have expected killed; now matches the fixed behaviour. |
| packages/core/src/tests/lifecycle-state.test.ts | Adds a deriveLegacyStatus assertion for terminated+agent_process_exited → killed; straightforward and correct. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[resolveProbeDecision called] --> B{activitySignal.state=valid\nactivity=exited\nprocessProbe.state=dead\n!processProbe.failed?}
B -- Yes --> C[Terminate: agent_process_exited\nstatus: killed]
B -- No --> D{runtimeProbe.failed OR\nprocessProbe.failed?}
D -- Yes --> E[createDetectingDecision\nprobe_failed evidence]
D -- No --> F{Signal disagreement?\nruntime≠process states OR\nruntime=dead+recent activity}
F -- Yes --> G[createDetectingDecision\nsignal_disagreement evidence]
F -- No --> H{runtime=dead\nprocess=unknown\ncanProbeRuntimeIdentity?}
H -- Yes --> I[createDetectingDecision\nruntime_dead evidence]
H -- No --> J{runtime=dead\nprocess=dead\n!recentActivityLiveness?}
J -- Yes --> K[Terminate: runtime_lost\nstatus: killed]
J -- No --> L[return null]
Reviews (2): Last reviewed commit: "fix(core): guard agent-exit termination ..." | Re-trigger Greptile
Align the new terminate-on-exit branch with the rest of resolveProbeDecision by requiring processProbe.failed === false before acting on its state, so a probe that defaults to dead on error can't terminate a live session. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
When an agent process exits but its tmux runtime stays alive (the keep-alive shell pattern),
resolveProbeDecisiontreatedruntime=alive+process=deadas a signal disagreement, ran the detecting cycle, and escalated the session tostuck/probe_failure. Every subsequent poll hit the same branch, so the session stayedstuckforever — lingering on the dashboard sidebar with an orange dot that no toggle could clear. Closes #1933 and #1966 (same root cause, different trigger).Root cause
packages/core/src/lifecycle-status-decisions.ts— theruntimeProbe.state === "alive" && processProbe.state === "dead"arm ofresolveProbeDecisionrouted intocreateDetectingDecision. But when the agent's native activity signal definitively reportsexited(and the process probe agrees), the runtime being alive is irrelevant — the agent is the source of truth. There was no path promotingstuck→terminatedonce all signals agreed the agent was gone.Fix
Add a branch at the top of
resolveProbeDecision: whenactivitySignal.state === "valid"+activity === "exited"andprocessProbe.state === "dead", terminate directly with reasonagent_process_exited(mapped to legacy statuskilledinderiveLegacyStatus). TheprocessProbe=deadrequirement keeps spawning sessions safe — their process probe staysunknownuntil the agent starts (#1035), so they are never killed prematurely.packages/core/src/lifecycle-status-decisions.ts— new terminate-on-exit branchpackages/core/src/lifecycle-state.ts— mapagent_process_exited→killedTest plan
resolveProbeDecision: terminates onactivity=exited, doesn't enter signal_disagreement, still treatsruntime=alive/process=deadas disagreement when the agent has not exitedderiveLegacyStatustest:terminated+agent_process_exited→killeddetecting) to expectkilledpnpm --filter @aoagents/ao-core test— 1326 passedpnpm --filter @aoagents/ao-core typecheck,pnpm lint(0 errors), fullpnpm build🤖 Generated with Claude Code