Skip to content

fix(spawn): clear stale psmux session registration before re-create#78

Merged
darw007d merged 2 commits into
mainfrom
claude/lc/spawn-stale-session-fix
Jun 18, 2026
Merged

fix(spawn): clear stale psmux session registration before re-create#78
darw007d merged 2 commits into
mainfrom
claude/lc/spawn-stale-session-fix

Conversation

@darw007d

Copy link
Copy Markdown
Collaborator

Problem

A psmux (native-Windows multiplexer) server that dies uncleanly — operator closed the owning console, kill, a crash — leaves the session name registered with no live server behind it. new-session then becomes a silent no-op (rc=0, creates nothing), and the name can never come back. That is the exact failure that strands a peer's cell: "the session got killed and won't return." Diagnosed end-to-end on workstation-lc 2026-06-18 (psmux v3.3.5).

Real tmux on Linux/mac does not hit this — new-session is synchronous and a dead session leaves no blocking registration.

Fix

_launch_via_tmux's create path now goes through a new _tmux_create_session(), which:

  1. kill-session first to clear any stale (server-less) registration — a harmless no-op when the name is truly absent;
  2. creates, then VERIFIES via has-session and RETRIES (6× / 0.4s) — because psmux clears the stale lock asynchronously (~1s observed) and new-session's exit code is 0 even on the silent no-op, so the rc cannot be trusted.

On real tmux this is a harmless no-op: the create succeeds on the first attempt and the verify loop breaks immediately.

Tests

  • Harness made stateful (a created session then has-session-exists) and time.sleep no-op'd.
  • Dropped the obsolete CalledProcessError failure case (check=True removed).
  • Added coverage for the stale-clear + async-retry path and the never-materialises fall-through.
  • tests/test_spawn_tmux_session.py 28 passed; adjacent test_spawn_command.py + test_claude_tmux_template.py 61 passed (run with PYTHONPATH=src).

Note for reviewers

_tmux_has_session and the new kill-session use the tmux =name exact-match prefix. Confirmed working on psmux empirically (normal spawns + has-session function on the Windows box today; the stranding bug is the stale-server case, orthogonal to = parsing). Flagging in case a future psmux build changes = handling.

🤖 Generated with Claude Code

darw007d and others added 2 commits June 18, 2026 15:05
A psmux (native-Windows) server that dies uncleanly leaves the session NAME
registered with no live server behind it. new-session then becomes a SILENT
no-op (rc=0, creates nothing) and the name can never come back — stranding a
peer's cell ("session got killed and won't return"; diagnosed on workstation-lc
2026-06-18). Real tmux on Linux/mac doesn't hit this.

_launch_via_tmux's create path now goes through _tmux_create_session, which:
  1. kill-session first to clear any stale (server-less) registration
     (harmless no-op when the name is truly absent),
  2. creates, then VERIFIES via has-session and RETRIES — because psmux clears
     the stale lock ASYNCHRONOUSLY (~1s) and the create's rc is 0 even on the
     silent no-op, so the exit code can't be trusted.
Harmless no-op on real tmux (create takes first try, loop breaks immediately).

Tests: harness made stateful (a created session then exists) + time.sleep
no-op'd; dropped the obsolete CalledProcessError case (check=True removed);
added stale-clear + async-retry coverage. 28 + 61 adjacent tests green.

OPEN for Lab review:
  - Confirm the branch base (cut off spawn-win-launch-fix @639fbda).
  - Verify psmux honors the `=`exact-match target prefix used by
    _tmux_has_session / the new kill-session; if not, BOTH need the bare-name
    form (couldn't confirm locally — psmux non-interactive state was polluted).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CHWUyK8b64sauTAK1GGQSo
…ws CI fix)

test_run_spawn_grok_assisted_memory_no_double_system_prompt asserts on the
composed grok argv (no doubled --system-prompt-override) but only mocked
os.execve. On Windows, grok launch() takes the subprocess.run branch (v0.12.1
pane-collapse fix), so the execve mock never fired, the real /bin/fake-grok exec
failed (WinError 2), exec_args stayed empty → assert 0 == 1. Red on Windows CI.

Pre-existing on main (verified on 98957f5 without the stale-session commit) — not
caused by PR #78; the full-suite CI just surfaces it. Fix per the file's own
convention (cf. test_claude_launch_posix_uses_execve): pin sys.platform to a POSIX
value so launch() takes the mocked os.execve branch. The assertion is platform-
independent (argv composition), so this preserves intent; the Windows-vs-POSIX
launch split itself is already covered by test_grok_launch_win32_blocks_posix_execve.

Full spawn suites green: 101 passed (PYTHONPATH=src).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01CHWUyK8b64sauTAK1GGQSo
@darw007d darw007d merged commit 0422c4b into main Jun 18, 2026
2 checks passed
@darw007d darw007d deleted the claude/lc/spawn-stale-session-fix branch June 18, 2026 13:26
darw007d added a commit that referenced this pull request Jun 18, 2026
Version bump to ship the already-merged + CI-green stranding fix to PyPI:
_tmux_create_session clears a stale (server-less) psmux session registration
before re-create (kill → create → verify-via-has-session + retry), so a peer
cell whose multiplexer died uncleanly can respawn instead of being stranded
('session got killed and won't return'). No-op on real tmux. Also carries the
test fix that cleared the pre-existing #77 Windows-CI red. Code merged in #78
(0422c4b); this is version + publish only.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant