fix(core): auto-terminate sessions when agent exits but runtime stays alive (closes #1933, #1966) by harshitsinghbhandari · Pull Request #2041 · AgentWrapper/agent-orchestrator

harshitsinghbhandari · 2026-05-23T10:32:55Z

Summary

When an agent process exits but its tmux runtime stays alive (the keep-alive shell pattern), resolveProbeDecision treated runtime=alive + process=dead as a signal disagreement, ran the detecting cycle, and escalated the session to stuck/probe_failure. Every subsequent poll hit the same branch, so the session stayed stuck forever — lingering on the dashboard sidebar with an orange dot that no toggle could clear. Closes #1933 and #1966 (same root cause, different trigger).

Root cause

packages/core/src/lifecycle-status-decisions.ts — the runtimeProbe.state === "alive" && processProbe.state === "dead" arm of resolveProbeDecision routed into createDetectingDecision. But when the agent's native activity signal definitively reports exited (and the process probe agrees), the runtime being alive is irrelevant — the agent is the source of truth. There was no path promoting stuck → terminated once all signals agreed the agent was gone.

Fix

Add a branch at the top of resolveProbeDecision: when activitySignal.state === "valid" + activity === "exited" and processProbe.state === "dead", terminate directly with reason agent_process_exited (mapped to legacy status killed in deriveLegacyStatus). The processProbe=dead requirement keeps spawning sessions safe — their process probe stays unknown until the agent starts (#1035), so they are never killed prematurely.

packages/core/src/lifecycle-status-decisions.ts — new terminate-on-exit branch
packages/core/src/lifecycle-state.ts — map agent_process_exited → killed

Test plan

New unit tests for resolveProbeDecision: terminates on activity=exited, doesn't enter signal_disagreement, still treats runtime=alive/process=dead as disagreement when the agent has not exited
New deriveLegacyStatus test: terminated + agent_process_exited → killed
Updated the mislabeled existing test ("detects killed state when getActivityState returns exited" was asserting detecting) to expect killed
Verified spawning sessions are unaffected (fix: race condition marks newly spawned sessions as 'killed' before tmux initializes #1035 regression test still green)
pnpm --filter @aoagents/ao-core test — 1326 passed
pnpm --filter @aoagents/ao-core typecheck, pnpm lint (0 errors), full pnpm build

🤖 Generated with Claude Code

… alive When the agent process exits but tmux stays alive (keep-alive shell), resolveProbeDecision treated runtime=alive + process=dead as a signal disagreement, ran the detecting cycle, and parked the session at stuck/probe_failure forever. The native activity signal + process probe both agree the agent is gone, so terminate directly instead. Closes AgentWrapper#1933, AgentWrapper#1966. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps · 2026-05-23T10:36:53Z

Greptile Summary

This PR fixes a long-standing bug where sessions would get permanently parked in stuck/probe_failure state when an agent process exits but the tmux keep-alive shell keeps the runtime alive. The root cause was resolveProbeDecision treating runtime=alive + process=dead as a signal disagreement, escalating to stuck with no path out.

lifecycle-status-decisions.ts: Adds an early-return branch that fires when the native activity signal (valid + exited) and the process probe (dead + !failed) both confirm the agent has exited, terminating with reason agent_process_exited regardless of the runtime probe state. The processProbe.state === \"dead\" guard correctly excludes spawning sessions whose process probe stays unknown until the agent starts.
lifecycle-state.ts: Extends the terminated switch in deriveLegacyStatus to map agent_process_exited → killed, closing the loop on legacy status derivation.
Tests: Existing mislabeled test corrected from detecting to killed; new unit tests cover the terminate path, the no-signal-disagreement invariant, and the runtime_lost fallback when no exit signal is present.

Confidence Score: 5/5

Safe to merge — the change is tightly scoped to a single new guard in resolveProbeDecision, the spawning-session protection is correctly preserved via the processProbe.state=dead requirement, and the previously flagged processProbe.failed concern is addressed.

The new early-return branch has a narrow, well-defined trigger (native activity signal + process probe must independently agree the agent exited), the processProbe.failed guard prevents probe-error false-positives, and the processProbe.state=dead gate keeps spawning sessions unaffected. All three new unit tests target distinct branches of the decision, the integration test correction is accurate, and the deriveLegacyStatus mapping is straightforward. No unguarded paths or unintended side-effects were found.

No files require special attention.

Important Files Changed

Filename	Overview
packages/core/src/lifecycle-status-decisions.ts	Core fix: new early-return branch in resolveProbeDecision terminates sessions on activity=exited + processProbe=dead; processProbe.failed guard prevents premature termination on probe errors.
packages/core/src/lifecycle-state.ts	Adds agent_process_exited → killed mapping to deriveLegacyStatus; agent_process_exited was already in CanonicalSessionReason, just not handled in the terminated switch.
packages/core/src/tests/lifecycle-status-decisions.test.ts	New resolveProbeDecision test suite covers the terminate-on-exit path, signal-disagreement bypass, and the fallback-to-detecting case when no exit signal is present.
packages/core/src/tests/lifecycle-manager.test.ts	Corrects a mislabeled integration test that previously expected detecting but should have expected killed; now matches the fixed behaviour.
packages/core/src/tests/lifecycle-state.test.ts	Adds a deriveLegacyStatus assertion for terminated+agent_process_exited → killed; straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[resolveProbeDecision called] --> B{activitySignal.state=valid\nactivity=exited\nprocessProbe.state=dead\n!processProbe.failed?}
    B -- Yes --> C[Terminate: agent_process_exited\nstatus: killed]
    B -- No --> D{runtimeProbe.failed OR\nprocessProbe.failed?}
    D -- Yes --> E[createDetectingDecision\nprobe_failed evidence]
    D -- No --> F{Signal disagreement?\nruntime≠process states OR\nruntime=dead+recent activity}
    F -- Yes --> G[createDetectingDecision\nsignal_disagreement evidence]
    F -- No --> H{runtime=dead\nprocess=unknown\ncanProbeRuntimeIdentity?}
    H -- Yes --> I[createDetectingDecision\nruntime_dead evidence]
    H -- No --> J{runtime=dead\nprocess=dead\n!recentActivityLiveness?}
    J -- Yes --> K[Terminate: runtime_lost\nstatus: killed]
    J -- No --> L[return null]

_{Reviews (2): Last reviewed commit: "fix(core): guard agent-exit termination ..." | Re-trigger Greptile}

Align the new terminate-on-exit branch with the rest of resolveProbeDecision by requiring processProbe.failed === false before acting on its state, so a probe that defaults to dead on error can't terminate a live session. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

harshitsinghbhandari and others added 2 commits May 23, 2026 16:01

chore: add changeset for agent-exit auto-termination fix

faec509

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps Bot reviewed May 23, 2026

View reviewed changes

Comment thread packages/core/src/lifecycle-status-decisions.ts

harshitsinghbhandari requested a review from yyovil May 23, 2026 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): auto-terminate sessions when agent exits but runtime stays alive (closes #1933, #1966)#2041

fix(core): auto-terminate sessions when agent exits but runtime stays alive (closes #1933, #1966)#2041
harshitsinghbhandari wants to merge 3 commits into
AgentWrapper:mainfrom
harshitsinghbhandari:feat/1933

harshitsinghbhandari commented May 23, 2026

Uh oh!

greptile-apps Bot commented May 23, 2026 •

edited

Loading

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

harshitsinghbhandari commented May 23, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

greptile-apps Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 23, 2026 •

edited

Loading