Skip to content

fix(core): auto-terminate sessions when agent exits but runtime stays alive (closes #1933, #1966)#2041

Open
harshitsinghbhandari wants to merge 3 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:feat/1933
Open

fix(core): auto-terminate sessions when agent exits but runtime stays alive (closes #1933, #1966)#2041
harshitsinghbhandari wants to merge 3 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:feat/1933

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Contributor

Summary

When an agent process exits but its tmux runtime stays alive (the keep-alive shell pattern), resolveProbeDecision treated runtime=alive + process=dead as a signal disagreement, ran the detecting cycle, and escalated the session to stuck/probe_failure. Every subsequent poll hit the same branch, so the session stayed stuck forever — lingering on the dashboard sidebar with an orange dot that no toggle could clear. Closes #1933 and #1966 (same root cause, different trigger).

Root cause

packages/core/src/lifecycle-status-decisions.ts — the runtimeProbe.state === "alive" && processProbe.state === "dead" arm of resolveProbeDecision routed into createDetectingDecision. But when the agent's native activity signal definitively reports exited (and the process probe agrees), the runtime being alive is irrelevant — the agent is the source of truth. There was no path promoting stuckterminated once all signals agreed the agent was gone.

Fix

Add a branch at the top of resolveProbeDecision: when activitySignal.state === "valid" + activity === "exited" and processProbe.state === "dead", terminate directly with reason agent_process_exited (mapped to legacy status killed in deriveLegacyStatus). The processProbe=dead requirement keeps spawning sessions safe — their process probe stays unknown until the agent starts (#1035), so they are never killed prematurely.

  • packages/core/src/lifecycle-status-decisions.ts — new terminate-on-exit branch
  • packages/core/src/lifecycle-state.ts — map agent_process_exitedkilled

Test plan

  • New unit tests for resolveProbeDecision: terminates on activity=exited, doesn't enter signal_disagreement, still treats runtime=alive/process=dead as disagreement when the agent has not exited
  • New deriveLegacyStatus test: terminated + agent_process_exitedkilled
  • Updated the mislabeled existing test ("detects killed state when getActivityState returns exited" was asserting detecting) to expect killed
  • Verified spawning sessions are unaffected (fix: race condition marks newly spawned sessions as 'killed' before tmux initializes #1035 regression test still green)
  • pnpm --filter @aoagents/ao-core test — 1326 passed
  • pnpm --filter @aoagents/ao-core typecheck, pnpm lint (0 errors), full pnpm build

🤖 Generated with Claude Code

harshitsinghbhandari and others added 2 commits May 23, 2026 16:01
… alive

When the agent process exits but tmux stays alive (keep-alive shell),
resolveProbeDecision treated runtime=alive + process=dead as a signal
disagreement, ran the detecting cycle, and parked the session at
stuck/probe_failure forever. The native activity signal + process probe
both agree the agent is gone, so terminate directly instead.

Closes AgentWrapper#1933, AgentWrapper#1966.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 23, 2026

Greptile Summary

This PR fixes a long-standing bug where sessions would get permanently parked in stuck/probe_failure state when an agent process exits but the tmux keep-alive shell keeps the runtime alive. The root cause was resolveProbeDecision treating runtime=alive + process=dead as a signal disagreement, escalating to stuck with no path out.

  • lifecycle-status-decisions.ts: Adds an early-return branch that fires when the native activity signal (valid + exited) and the process probe (dead + !failed) both confirm the agent has exited, terminating with reason agent_process_exited regardless of the runtime probe state. The processProbe.state === \"dead\" guard correctly excludes spawning sessions whose process probe stays unknown until the agent starts.
  • lifecycle-state.ts: Extends the terminated switch in deriveLegacyStatus to map agent_process_exitedkilled, closing the loop on legacy status derivation.
  • Tests: Existing mislabeled test corrected from detecting to killed; new unit tests cover the terminate path, the no-signal-disagreement invariant, and the runtime_lost fallback when no exit signal is present.

Confidence Score: 5/5

Safe to merge — the change is tightly scoped to a single new guard in resolveProbeDecision, the spawning-session protection is correctly preserved via the processProbe.state=dead requirement, and the previously flagged processProbe.failed concern is addressed.

The new early-return branch has a narrow, well-defined trigger (native activity signal + process probe must independently agree the agent exited), the processProbe.failed guard prevents probe-error false-positives, and the processProbe.state=dead gate keeps spawning sessions unaffected. All three new unit tests target distinct branches of the decision, the integration test correction is accurate, and the deriveLegacyStatus mapping is straightforward. No unguarded paths or unintended side-effects were found.

No files require special attention.

Important Files Changed

Filename Overview
packages/core/src/lifecycle-status-decisions.ts Core fix: new early-return branch in resolveProbeDecision terminates sessions on activity=exited + processProbe=dead; processProbe.failed guard prevents premature termination on probe errors.
packages/core/src/lifecycle-state.ts Adds agent_process_exited → killed mapping to deriveLegacyStatus; agent_process_exited was already in CanonicalSessionReason, just not handled in the terminated switch.
packages/core/src/tests/lifecycle-status-decisions.test.ts New resolveProbeDecision test suite covers the terminate-on-exit path, signal-disagreement bypass, and the fallback-to-detecting case when no exit signal is present.
packages/core/src/tests/lifecycle-manager.test.ts Corrects a mislabeled integration test that previously expected detecting but should have expected killed; now matches the fixed behaviour.
packages/core/src/tests/lifecycle-state.test.ts Adds a deriveLegacyStatus assertion for terminated+agent_process_exited → killed; straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[resolveProbeDecision called] --> B{activitySignal.state=valid\nactivity=exited\nprocessProbe.state=dead\n!processProbe.failed?}
    B -- Yes --> C[Terminate: agent_process_exited\nstatus: killed]
    B -- No --> D{runtimeProbe.failed OR\nprocessProbe.failed?}
    D -- Yes --> E[createDetectingDecision\nprobe_failed evidence]
    D -- No --> F{Signal disagreement?\nruntime≠process states OR\nruntime=dead+recent activity}
    F -- Yes --> G[createDetectingDecision\nsignal_disagreement evidence]
    F -- No --> H{runtime=dead\nprocess=unknown\ncanProbeRuntimeIdentity?}
    H -- Yes --> I[createDetectingDecision\nruntime_dead evidence]
    H -- No --> J{runtime=dead\nprocess=dead\n!recentActivityLiveness?}
    J -- Yes --> K[Terminate: runtime_lost\nstatus: killed]
    J -- No --> L[return null]
Loading

Reviews (2): Last reviewed commit: "fix(core): guard agent-exit termination ..." | Re-trigger Greptile

Comment thread packages/core/src/lifecycle-status-decisions.ts
Align the new terminate-on-exit branch with the rest of resolveProbeDecision
by requiring processProbe.failed === false before acting on its state, so a
probe that defaults to dead on error can't terminate a live session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(core): stuck session with exited runtime never auto-promotes to terminated — lingers in sidebar indefinitely

1 participant