Skip to content

DO NOT MERGE / DO NOT DEPLOY: Refactor vm-agent lifecycle shutdown ownership#1353

Draft
simple-agent-manager[bot] wants to merge 5 commits into
mainfrom
sam/task-implement-sam-idea-01kvcx
Draft

DO NOT MERGE / DO NOT DEPLOY: Refactor vm-agent lifecycle shutdown ownership#1353
simple-agent-manager[bot] wants to merge 5 commits into
mainfrom
sam/task-implement-sam-idea-01kvcx

Conversation

@simple-agent-manager

Copy link
Copy Markdown
Contributor

DO NOT MERGE / DO NOT DEPLOY TO STAGING

Human constraint for this /do task: stop at a draft/open PR only. Do not merge this PR. Do not deploy it to staging. Do not mark it ready unless Raphaël explicitly asks later.

Summary

Implements SAM idea 01KVCX0T7DR7JAGK3AS6HSX9GA for the narrow vm-agent lifecycle correctness slice:

  • makes Server.Stop idempotent with sync.Once and preserves the first shutdown result
  • stops ACP session hosts using collect-then-stop semantics outside sessionHostMu
  • removes PTY session cleanup from broad workspaceMu critical sections in touched shutdown paths
  • shuts down server-owned event store and resource monitor from Server.Stop
  • makes event store and resource monitor Close methods idempotent
  • makes errorreport and messagereport Shutdown idempotent and concurrent-safe
  • preserves errorreport final flush even when Shutdown is called before Start
  • adds focused tests for Server.Stop idempotency, owned resource shutdown, reporter idempotency/concurrency, and session-host lock discipline

Lifecycle Inventory

  • Server-owned: HTTP server, JWT validator, port scanners, ACP session-host map, workspace PTY managers, error reporter, per-workspace message reporters, persistence store, event store, resource monitor.
  • Workspace/session-owned: PTY sessions remain owned by PTY managers; ACP process/viewer cleanup remains owned by SessionHost.Stop.
  • Lock discipline: session-host map/metadata removal happens under sessionHostMu, while host.Stop runs after the lock is released. Workspace PTY managers are collected under workspaceMu and closed after workspaceMu is released.
  • Background loops: health/ACP heartbeat continue to observe s.done, now closed only via Server.stopOnce; resource monitor Close is idempotent and waits for its loop once.

Specialist Review Evidence

Reviewer Status Evidence
task-completion-validator PASS All research findings and checklist items are represented in the diff and tests; no UI/backend or multi-resource checks apply.
go-specialist ADDRESSED Found one preservation issue: errorreport Shutdown before Start should still flush queued entries. Fixed in commit 09189f3 and re-tested. No remaining blocker.

Tests Run

From packages/vm-agent:

  • go test ./internal/server ./internal/errorreport ./internal/messagereport
  • go test ./internal/server ./internal/auth ./internal/errorreport ./internal/messagereport ./internal/acp ./internal/pty
  • go test ./internal/server ./internal/errorreport ./internal/messagereport ./internal/eventstore ./internal/resourcemon
  • go test ./...
  • go test -race ./internal/server ./internal/errorreport ./internal/messagereport ./internal/acp ./internal/pty

Deployment

No staging deployment was performed. No production deployment was performed. This PR must remain unmerged and undeployed until Raphaël explicitly authorizes otherwise.

@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
4.8% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@simple-agent-manager

Copy link
Copy Markdown
Contributor Author

Behavioral Equivalence Audit — PR #1353 (VM Agent Lifecycle Shutdown)

This PR was reviewed for behavioral equivalence as part of the code elegance audit. Below is a summary of changes that go beyond pure restructuring.

Intentional Improvements (All Acceptable)

Change Type Rationale
sync.Once idempotent Stop() Defensive Prevents double-close panics if Stop() is called multiple times
Nil guards on httpServer, tlsServer, mux Defensive Prevents panics on partially-initialized servers (e.g., early shutdown during startup)
EventStore and ResourceMonitor now closed in shutdown Bugfix Fixes resource leaks — these were never closed in the original code
Session hosts stopped outside sessionHostMu lock Concurrency fix Original held the mutex during potentially slow I/O (Stop() calls); new code collects references under the lock, then stops outside it, reducing deadlock risk
PTY cleanup outside lock Same pattern Consistent with the session host change — avoids holding a lock during I/O
removeAllSessionHosts() + collectWorkspacePTYManagers() helpers Structural Extracted for clarity; behavior is equivalent to inline loops

Assessment

All behavior changes are intentional hardening improvements. The sync.Once pattern, nil guards, and lock-scope reduction are standard Go idioms for safe shutdown. The EventStore/ResourceMonitor closure fixes genuine resource leaks.

Recommendation: Accept as-is. No fixes needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant