Skip to content

Drain active_connections_ in McpServer::shutdown and cover the contract#218

Merged
gophergogo merged 3 commits into
mainfrom
feature/mcp-server-lifecycle-shutdown-drain
Apr 20, 2026
Merged

Drain active_connections_ in McpServer::shutdown and cover the contract#218
gophergogo merged 3 commits into
mainfrom
feature/mcp-server-lifecycle-shutdown-drain

Conversation

@gophergogo
Copy link
Copy Markdown
Collaborator

Summary

Fix the pre-existing McpServer teardown bug and cover the lifecycle+drain contract end-to-end, then drop the leak workaround the previous integration test needed around it.

Three commits, each independently reviewable:

  1. McpServer::shutdown drain fix (src/server/mcp_server.cc) — shutdown() now clears active_connections_ and lifecycle_callbacks_ inside the cleanup post on the dispatcher thread, so each ConnectionImpl destructor fires closeSocket(LocalClose) there rather than on whichever thread ends up running ~McpServer. Two correctness details are documented in the commit message:

    • server_running_ is already false by the time the post runs, so the container-unwind branch inside onConnectionLifecycleEvent is skipped — otherwise each synchronous LocalClose would try to erase its own element from the vector we're actively clearing.
    • Connections are cleared before their lifecycle adapters. The connection's callback list still references the adapter during the final closeSocket in ~ConnectionImpl; destroying the adapter first would be a use-after-free on that last callback.
  2. Integration test (tests/integration/test_mcp_server_connection_lifecycle.cc) — two real-IO tests against a loopback McpServer:

    • RepeatedConnectCloseCyclesStayStable — three sequential connect/close cycles, then verify the listener is still accepting.
    • ShutdownClosesLiveConnections — hold a client TCP connection open, call shutdown(), and assert both that the client reads EOF (proving the drain actually closed the server-side fd) and that server_.reset() returns normally (proving the drain emptied the containers before ~McpServer).

    I confirmed the second test reproduces the exact onConnectionLifecycleEvent off dispatcher assert at mcp_server.cc:804 against the pre-fix mcp_server.cc; under the fix both tests pass.

  3. Drop the leak workaround in the initialize-routing test — the earlier integration test deliberately release()d the server because shutdown couldn't clean up connections. With the drain fix it can, so the workaround goes away. As a side benefit the 500 ms drain sleep also disappears, cutting the test's wall time from ~650 ms to ~140 ms.

Test plan

  • ctest -R \"[Mm]cp|[Ss]erver|[Ss]se|[Hh]ttp\" — 38/38 pass
  • New lifecycle test passes 5× in a row (~270 ms each)
  • Reverted the fix locally and re-ran the new lifecycle test — reproduces the pre-fix line 804 assert, confirming the test is actually guarding the contract

gophergogo added 3 commits April 20, 2026 10:56
#218)

shutdown() previously flipped server_running_ = false on the caller
thread and then posted a cleanup task that disabled listeners and
cleared the connection-manager list -- but never cleared
active_connections_ or lifecycle_callbacks_. Those survived into
~McpServer, whose destruction runs on whichever thread owns the
server (typically the test thread or the app-main thread). Each
ConnectionImpl destructor fires closeSocket(LocalClose), which
forwards through the per-connection ConnectionLifecycleCallbacks
adapter into McpServer::onConnectionLifecycleEvent -- which
asserts it is running on the main dispatcher thread. The process
then crashes at teardown even when the body of work completed
cleanly.

Fix: drain active_connections_ and lifecycle_callbacks_ inside the
cleanup post, so each destruction fires on the dispatcher thread.
Two details that make this safe:

- server_running_ is already false by the time the post runs, so
  the container-unwind branch inside onConnectionLifecycleEvent is
  skipped. Otherwise each synchronous LocalClose would try to
  erase its own element from the vector we are actively clearing,
  invalidating iterators and possibly dropping adapters that are
  mid-callback.

- Connections are cleared before their lifecycle adapters. The
  connection's callback list still references the adapter during
  the last closeSocket in ~ConnectionImpl; destroying the adapter
  first would be a use-after-free on that last callback.
Two real-IO integration tests against a loopback McpServer.

RepeatedConnectCloseCyclesStayStable drives three sequential
connect/close cycles against the listener. Each cycle relies on the
per-connection ConnectionLifecycleCallbacks adapter firing RemoteClose
on the dispatcher thread and erasing its entry from active_connections_
and lifecycle_callbacks_ via the deferred-delete queue. A fourth
connect after the three cycles confirms the listener is still
accepting.

ShutdownClosesLiveConnections is the sharp one. It holds a client TCP
connection open, calls McpServer::shutdown(), and asserts two observable
consequences of the drain path:

  - The client's socket reads EOF. Only the dispatcher-thread drain of
    active_connections_ could have closed the server-side fd during
    shutdown; without the drain, the server fd would survive until
    ~McpServer and the client wouldn't see FIN until much later.

  - server_.reset() returns normally. Pre-drain, destroying the server
    with a live connection in active_connections_ fired
    ~ConnectionImpl -> closeSocket(LocalClose) on the test thread,
    which hit the "onConnectionLifecycleEvent off dispatcher" assert
    inside the adapter and crashed the process at teardown. Running
    this test against the pre-fix revision of mcp_server.cc reproduces
    the abort on the very first test; under the fix both tests pass.

A raw TCP client is used instead of a full McpClient because the
lifecycle adapter only observes transport-level ConnectionEvent values,
and keeping the client dumb isolates the test to the server-side
callback-routing and drain behavior.
The earlier revision of this test ended TearDown with server_.release()
because the McpServer teardown bug -- active_connections_ surviving
into ~McpServer on the caller thread -- would trip the
dispatcher-thread assert inside onConnectionLifecycleEvent whenever a
connection was still live at shutdown. The companion drain fix earlier
in this branch clears active_connections_ and lifecycle_callbacks_
on the dispatcher thread inside the shutdown post, so by the time
server_.reset() runs on the test thread the containers are empty and
~McpServer is a straightforward destruction.

As a side effect, teardown no longer needs the 500ms safety-margin
drain sleep either; the test completes in roughly a quarter of the
earlier wall time.
@gophergogo gophergogo force-pushed the feature/mcp-server-lifecycle-shutdown-drain branch from c88e7c5 to e7dd596 Compare April 20, 2026 17:57
@gophergogo gophergogo merged commit af8ca80 into main Apr 20, 2026
1 check passed
gophergogo pushed a commit that referenced this pull request Apr 20, 2026
#218)

shutdown() previously flipped server_running_ = false on the caller
thread and then posted a cleanup task that disabled listeners and
cleared the connection-manager list -- but never cleared
active_connections_ or lifecycle_callbacks_. Those survived into
~McpServer, whose destruction runs on whichever thread owns the
server (typically the test thread or the app-main thread). Each
ConnectionImpl destructor fires closeSocket(LocalClose), which
forwards through the per-connection ConnectionLifecycleCallbacks
adapter into McpServer::onConnectionLifecycleEvent -- which
asserts it is running on the main dispatcher thread. The process
then crashes at teardown even when the body of work completed
cleanly.

Fix: drain active_connections_ and lifecycle_callbacks_ inside the
cleanup post, so each destruction fires on the dispatcher thread.
Two details that make this safe:

- server_running_ is already false by the time the post runs, so
  the container-unwind branch inside onConnectionLifecycleEvent is
  skipped. Otherwise each synchronous LocalClose would try to
  erase its own element from the vector we are actively clearing,
  invalidating iterators and possibly dropping adapters that are
  mid-callback.

- Connections are cleared before their lifecycle adapters. The
  connection's callback list still references the adapter during
  the last closeSocket in ~ConnectionImpl; destroying the adapter
  first would be a use-after-free on that last callback.
gophergogo pushed a commit that referenced this pull request Apr 20, 2026
Two real-IO integration tests against a loopback McpServer.

RepeatedConnectCloseCyclesStayStable drives three sequential
connect/close cycles against the listener. Each cycle relies on the
per-connection ConnectionLifecycleCallbacks adapter firing RemoteClose
on the dispatcher thread and erasing its entry from active_connections_
and lifecycle_callbacks_ via the deferred-delete queue. A fourth
connect after the three cycles confirms the listener is still
accepting.

ShutdownClosesLiveConnections is the sharp one. It holds a client TCP
connection open, calls McpServer::shutdown(), and asserts two observable
consequences of the drain path:

  - The client's socket reads EOF. Only the dispatcher-thread drain of
    active_connections_ could have closed the server-side fd during
    shutdown; without the drain, the server fd would survive until
    ~McpServer and the client wouldn't see FIN until much later.

  - server_.reset() returns normally. Pre-drain, destroying the server
    with a live connection in active_connections_ fired
    ~ConnectionImpl -> closeSocket(LocalClose) on the test thread,
    which hit the "onConnectionLifecycleEvent off dispatcher" assert
    inside the adapter and crashed the process at teardown. Running
    this test against the pre-fix revision of mcp_server.cc reproduces
    the abort on the very first test; under the fix both tests pass.

A raw TCP client is used instead of a full McpClient because the
lifecycle adapter only observes transport-level ConnectionEvent values,
and keeping the client dumb isolates the test to the server-side
callback-routing and drain behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants