Detect dead connections via a read timeout and reconnect#373
Open
ptr727 wants to merge 4 commits into
Open
Conversation
The Host Command connections (event stream and command client) read with no timeout, relying on the peer to close the socket to surface a lost connection. On a hard power-off there is no TCP RST, and the event stream's ECHO keepalive writes only fail once the OS abandons the connection (~15 min). Until then: - the event stream's read blocks forever, so Connected/Disconnected/Reconnected events never fire and consumers think the controller is still online; - the command client's reads time out (it already passes read_timeout) but the timeout does not close the socket, and _get_connection only reconnects when the connection is closed, so every subsequent request reuses the dead socket and keeps timing out. Fix both: - Event stream: read with a timeout (2x KEEPALIVE_INTERVAL). The ECHO keepalive round-trips every 60s on a healthy link (verified: ECHO -> R:ECHO), so a 120s silence means the link is dead. ClientTimeoutError is a ClientConnectionError, so it already routes to the reconnect path. - Both clients: close the connection on ClientConnectionError so the next _get_connection establishes a fresh socket instead of reusing the stale one (which otherwise flaps Reconnected every timeout and never recovers). Validated end-to-end against a real controller (ping-correlated power-off/on): connection loss is detected ~120s after the link dies, the connection state no longer flaps while down, and both the event stream and per-object data recover within seconds of the controller returning (previously stuck until the ~15 min OS timeout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pyright flags annotating a @contextmanager function's return type as Iterator as deprecated (reportDeprecated); use Generator instead. This is the only error currently failing the Lint workflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover Python 3.14 in the pre-commit lint/type-check matrix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- examples/control_load.py: annotate the context manager as Generator[None, None, None] so it is valid at runtime on every supported Python version (a bare Generator[None] relies on type-parameter defaults that only exist in newer Python). - events.py: reword the READ_TIMEOUT comment to use the Disconnected event name instead of the looser "Disconnected/reconnect". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 1, 2026
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds explicit read timeouts and socket teardown on command/event streams to detect dead TCP links promptly and force clean reconnection instead of waiting on OS-level TCP timeouts.
Changes:
- Introduces a read timeout for event stream reads tied to the keepalive interval.
- Closes connections explicitly on
ClientConnectionError(including read timeouts) to avoid reusing stale sockets. - Updates an example type annotation and expands CI lint matrix to include Python 3.14.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/aiovantage/_command_client/events.py | Adds READ_TIMEOUT and closes the connection on timeout/connection loss to trigger reconnect behavior. |
| src/aiovantage/_command_client/client.py | Ensures command requests close the socket on connection/timeout errors to avoid reusing an unknown-state connection. |
| examples/loads/control_load.py | Adjusts contextmanager return type annotation to Generator. |
| .github/workflows/lint.yml | Adds Python 3.14 to the lint test matrix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Host Command connections (event stream and command client) read with no timeout and rely on the peer closing the socket to surface a lost connection. On a hard power-off there is no TCP RST, and the event stream's
ECHOkeepalive writes only fail once the OS abandons the connection (about 15 minutes). Until then:Connected/Disconnected/Reconnectednever fire and consumers keep treating the controller as online;read_timeout) but the timeout does not close the socket, and_get_connectiononly reconnects when the connection is closed, so every later request reuses the dead socket and keeps timing out.Net effect: a powered-off or crashed controller is not detected for about 15 minutes, and per-object data does not recover until then.
How it is implemented
_command_client/events.py): read with a timeout of2 * KEEPALIVE_INTERVAL. TheECHOkeepalive round-trips every 60s on a healthy link, so 120s of silence means the link is dead.ClientTimeoutErrorsubclassesClientConnectionError, so it already routes into the existing reconnect path.ClientConnectionError, close the connection so the next_get_connectionopens a fresh socket instead of reusing the stale one. Without this, a read timeout leaves the socket open and the event stream re-emitsReconnectedon a dead socket, flapping every timeout and never recovering.Two small follow-on commits keep CI green: a
Generator(notIterator) return type on a@contextmanagerexample to clear the only pyrightreportDeprecatederror, and Python 3.14 added to the lint matrix.How it was tested
Validated against a real InFusion controller by correlating Home Assistant entity state with ground-truth ICMP reachability across several power-off / power-on cycles:
ECHOround-trips every 60s (ECHO->R:ECHO), so a healthy but quiet link never reaches the read timeout.ruff check,ruff format --check, and pyright 1.1.410 pass across the 3.10 - 3.14 matrix.