Skip to content

Reclaim idle sessions to fix session-capacity overshoot#277

Merged
cmpatino merged 2 commits into
mainfrom
reclaim-idle-sessions
Jun 1, 2026
Merged

Reclaim idle sessions to fix session-capacity overshoot#277
cmpatino merged 2 commits into
mainfrom
reclaim-idle-sessions

Conversation

@cmpatino
Copy link
Copy Markdown
Collaborator

@cmpatino cmpatino commented Jun 1, 2026

Users hit "Server is at capacity (379/200)" and couldn't create sessions: idle/forgotten runtimes never left the global pool, the frontend reactivated every historical session on load, and concurrent creates could overshoot the cap.

Backend:

  • Idle-session reaper (2h): releases the sandbox + RAM of inactive sessions and evicts them from the live pool, leaving them resumable from Mongo (persisted resumable, never "ended"). Teardown uses asyncio.wait so a shutdown cancel can't be swallowed; submit() enqueues under the lock and refuses sessions being reaped. Reaping is skipped when persistence is disabled and aborts if the resumable snapshot (incl. the message bulk_write, via save_snapshot strict mode) can't be durably written.
  • Global create-slot reservation closes the check-then-create race that let concurrent creates exceed MAX_SESSIONS.
  • last_active_at drives the reaper, stamped on genuine activity only.
  • Restore injects a sandbox-reset note so the agent knows its sandbox was wiped, skipped when an approval is pending (would orphan the tool results).

Frontend:

  • Only the active or backend-processing session reactivates its runtime on load and tab refocus; idle sessions render from cache, so opening the app no longer refills the pool. isProcessing is cleared on all terminal and approval paths.

Tests: reaper happy path/spared cases, submit-reap race, turn-finish restamp, global reservation race, persistence-disabled and write-failure aborts, shutdown cancellation propagation, restore-note inject/skip.

Users hit "Server is at capacity (379/200)" and couldn't create sessions:
idle/forgotten runtimes never left the global pool, the frontend reactivated
every historical session on load, and concurrent creates could overshoot the cap.

Backend:
- Idle-session reaper (2h): releases the sandbox + RAM of inactive sessions and
  evicts them from the live pool, leaving them resumable from Mongo (persisted
  resumable, never "ended"). Teardown uses asyncio.wait so a shutdown cancel
  can't be swallowed; submit() enqueues under the lock and refuses sessions being
  reaped. Reaping is skipped when persistence is disabled and aborts if the
  resumable snapshot (incl. the message bulk_write, via save_snapshot strict
  mode) can't be durably written.
- Global create-slot reservation closes the check-then-create race that let
  concurrent creates exceed MAX_SESSIONS.
- last_active_at drives the reaper, stamped on genuine activity only.
- Restore injects a sandbox-reset note so the agent knows its sandbox was wiped,
  skipped when an approval is pending (would orphan the tool results).

Frontend:
- Only the active or backend-processing session reactivates its runtime on load
  and tab refocus; idle sessions render from cache, so opening the app no longer
  refills the pool. isProcessing is cleared on all terminal and approval paths.

Tests: reaper happy path/spared cases, submit-reap race, turn-finish restamp,
global reservation race, persistence-disabled and write-failure aborts, shutdown
cancellation propagation, restore-note inject/skip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Claude finished @cmpatino's task in 5m 46s —— View job


Claude's Review

1 P0, 3 P1.

Solid PR — the design is coherent (snapshot-then-pop, raise_on_error to gate eviction on durable persistence, reservation counter to close the create race, frontend gating on shouldReactivate). One race in the eviction/restore interplay is the one real blocker.


P0 — Old task's _run_session finally clobbers a freshly-restored session

backend/session_manager.py:1225-1232:

async with self._lock:
    if session_id in self.sessions:
        self.sessions[session_id].is_active = False
        await self.persist_session_snapshot(
            self.sessions[session_id],
            runtime_state="ended",
            status="ended",
        )

The finally looks up by session_id (string) instead of the closure-captured agent_session (instance). With the reaper now popping session_id before the old task even reaches this block, there is a real window where a concurrent ensure_session_loaded re-inserts a new wrapper under the same id, and the old finally then mutates the new instance.

Concrete interleaving:

  1. _reap_one acquires the lock, persists resumable snapshot, pops session_id, releases the lock, then task.cancel() + asyncio.wait({task}, timeout=30) (session_manager.py:1081-1126).
  2. The cancelled task enters its finally and parks inside await self._cleanup_sandbox(session) (session_manager.py:1212) — an HF Spaces delete_repo call that retries on transient failures, so it can run for seconds.
  3. During that window the user clicks the session in the sidebar → ensure_session_loaded sees session_id not in self.sessions, loads from Mongo, builds a new AgentSession, and _start_agent_session inserts it (session_manager.py:368-372). Nothing here checks is_reaping on the popped instance, and is_reaping lives on the old wrapper anyway.
  4. Old task's finally finally acquires the lock, finds session_id in self.sessions (it's the new wrapper), sets is_active=False on it, and writes status="ended" to Mongo.
  5. The next submit against the freshly-restored session is rejected at session_manager.py:1248-1254 (not agent_session.is_active). The user sees a session they just opened that won't accept input, and Mongo shows it as ended.

Fix: identity-check on the local reference. The same agent_session is already in scope from session_manager.py:1154, so:

async with self._lock:
    if self.sessions.get(session_id) is agent_session:
        agent_session.is_active = False
        await self.persist_session_snapshot(
            agent_session,
            runtime_state="ended",
            status="ended",
        )

This isn't pre-existing — before the reaper, the only way to exit _run_session was the session genuinely ending, so session_id was uniquely associated with this task. The reaper introduces id-recycling; the finally needs to be id-recycle-safe.

Fix this →


P1 — Reaper holds the manager lock across a Mongo write

backend/session_manager.py:1099-1105:

async with self._lock:
    ...
    agent_session.is_reaping = True
    try:
        await self.persist_session_snapshot(
            agent_session,
            runtime_state="idle",
            status="active",
            raise_on_error=True,
        )

persist_session_snapshot does a Mongo metadata update + bulk_write inside the critical section. Every other manager op contends on the same lock (submit, create_session, ensure_session_loaded, delete_session, …). With multiple candidates per sweep, each candidate's network round-trip serializes all session traffic.

Setting is_reaping=True is itself the gate against the submit race (and the reaper's own re-check), so consider releasing the lock around the persist:

async with self._lock:
    # re-check, set is_reaping
agent_session.is_reaping = True  # snapshot identity preserved
try:
    await self.persist_session_snapshot(..., raise_on_error=True)
except Exception:
    async with self._lock:
        agent_session.is_reaping = False
    return False
async with self._lock:
    # final re-check then pop

Not a correctness issue, but turns the reaper from "imperceptible background job" into "stalls all traffic for the length of a Mongo bulk_write per candidate."


P1 — Test gap: no coverage for the post-pop / restore window

tests/unit/test_session_reaper.py exercises submit-before-pop (test_reap_aborts_when_message_enqueued_first), submit-during-reaping (test_submit_rejected_while_reaping), and shutdown cancellation. The realistic gap — _cleanup_sandbox is slow, user reopens during it — isn't covered, which is why the P0 above can ship undetected. A test that stubs _cleanup_sandbox to block, drives _reap_one to pop, runs ensure_session_loaded concurrently (or directly mutates self.sessions[session_id] to a fresh wrapper), then unblocks the cleanup and asserts the new wrapper keeps is_active=True would lock this in.


P1 — setSessionProcessing mirroring is incomplete

The hook comments promise that sessionStore.isProcessing mirrors agentStore.isProcessing because the former feeds shouldReactivate (useAgentChat.ts:48). Several call sites only update one of the two:

  • useAgentChat.ts:471-486 (hydration result handler) — sets/clears isProcessing via updateSession only
  • useAgentChat.ts:687 and :712 (visibility re-hydrate + poll cleanup) — updateSession only
  • useAgentChat.ts:788 (approveTools resume) — updateSession only
  • useAgentChat.ts:807 (stop) — updateSession only
  • SessionChat.tsx:88 (handleSendMessage) — updateSession only

Every one of those runs on the active tab, so today the drift is invisible (isActive=true already pins shouldReactivate=true). But the partialize at sessionStore.ts:248-251 resets isProcessing on cold load and the side-channels are now load-bearing for that flag's correctness — pulling them out of sync invites a regression as soon as anything in the background relies on sessionStore.isProcessing. Either drop both sites behind a single helper, or derive one from the other.


  • Read repo conventions
  • Read changed files
  • Trace reaper / submit / restore code paths
  • Check test coverage
  • Post findings
    • Branch: reclaim-idle-sessions

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Claude finished @cmpatino's task in 2m 32s —— View job


No blocking issues.

Re-review of 5b40506 ("Address review comments"). All four findings from the prior pass are addressed:

  • P0 (_run_session finally clobbers freshly-restored session) — fixed via identity check at backend/session_manager.py:1244 (self.sessions.get(session_id) is agent_session), and the locked mutate/persist now uses agent_session directly.
  • P1 (reaper holds lock across Mongo write) — addressed by splitting _reap_one (backend/session_manager.py:1082-1132): persist now happens outside the lock with is_reaping=True as the submit gate, then a second locked re-check re-validates idle conditions before popping. Failure path resets is_reaping under the lock only if self.sessions.get(session_id) is agent_session.
  • P1 (test gap) — tests/unit/test_session_reaper.py:466-510 adds test_old_reaped_task_does_not_end_freshly_restored_session, which blocks _cleanup_sandbox, swaps in a fresh wrapper, then unblocks and asserts the fresh wrapper keeps is_active=True and is not persisted as ended.
  • P1 (setSessionProcessing mirroring) — unified behind setProcessingState in frontend/src/hooks/useAgentChat.ts:56-65; all 9 call sites flagged previously now route through it (including SessionChat.tsx:88 via prop), and the hydration handler at useAgentChat.ts:481-486 now clears isProcessing in both the pending-approvals and no-op branches, closing the partialize cold-load drift.

Verification

Traced the new reaper window (post-is_reaping=True, pre-persist, lock released):

  • submit is rejected (session_manager.py:1269).
  • delete_session can still pop concurrently; on re-check the identity comparison fails and the reap aborts (session_manager.py:1121-1122).
  • _run_session's natural exit can't fire during an idle window — session.is_running is only flipped from inside the agent loop or shutdown paths, and an idle session is parked in the submission_queue.get timeout loop. The reaper-triggered cancellation only runs after the pop, so the finally's identity check correctly no-ops.

No new P0 surfaced.
• Branch: reclaim-idle-sessions

@cmpatino cmpatino merged commit 6cf406a into main Jun 1, 2026
3 checks passed
@cmpatino cmpatino deleted the reclaim-idle-sessions branch June 1, 2026 09:12
cmpatino added a commit that referenced this pull request Jun 1, 2026
Adopt PR #277's reviewed reclaim-idle-sessions implementation (the reaper
persists the resumable snapshot outside the manager lock then re-checks under
it; the useAgentChat setProcessingState refactor; test_session_reaper) while
preserving this branch's premium-billing changes (premium_user_billed plumbing
in session_manager, and the ClaudeCapDialog / CLAUDE_QUOTA_EXHAUSTED removal in
useAgentChat).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant