Detect dead connections via a read timeout and reconnect#4
Closed
ptr727 wants to merge 4 commits into
Closed
Conversation
The Host Command connections (event stream and command client) read with no timeout, relying on the peer to close the socket to surface a lost connection. On a hard power-off there is no TCP RST, and the event stream's ECHO keepalive writes only fail once the OS abandons the connection (~15 min). Until then: - the event stream's read blocks forever, so Connected/Disconnected/Reconnected events never fire and consumers think the controller is still online; - the command client's reads time out (it already passes read_timeout) but the timeout does not close the socket, and _get_connection only reconnects when the connection is closed, so every subsequent request reuses the dead socket and keeps timing out. Fix both: - Event stream: read with a timeout (2x KEEPALIVE_INTERVAL). The ECHO keepalive round-trips every 60s on a healthy link (verified: ECHO -> R:ECHO), so a 120s silence means the link is dead. ClientTimeoutError is a ClientConnectionError, so it already routes to the reconnect path. - Both clients: close the connection on ClientConnectionError so the next _get_connection establishes a fresh socket instead of reusing the stale one (which otherwise flaps Reconnected every timeout and never recovers). Validated end-to-end against a real controller (ping-correlated power-off/on): connection loss is detected ~120s after the link dies, the connection state no longer flaps while down, and both the event stream and per-object data recover within seconds of the controller returning (previously stuck until the ~15 min OS timeout). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pyright flags annotating a @contextmanager function's return type as Iterator as deprecated (reportDeprecated); use Generator instead. This is the only error currently failing the Lint workflow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover Python 3.14 in the pre-commit lint/type-check matrix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves resiliency of the Host Command event stream and command client by detecting “dead” TCP connections via read timeouts and forcing a reconnect, reducing the time-to-detect for hard power-offs/crashes.
Changes:
- Add an event-stream read timeout tied to the keepalive interval so stalled reads surface disconnects promptly.
- Ensure both event stream and command client close the underlying socket on
ClientConnectionErrorso subsequent operations create a fresh connection. - Update an example type annotation and extend the lint workflow matrix to include Python 3.14.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/aiovantage/_command_client/events.py |
Adds a read timeout to detect dead event-stream connections and explicitly closes the socket on connection errors to enable clean reconnects. |
src/aiovantage/_command_client/client.py |
Closes the command client connection on ClientConnectionError so future requests don’t reuse a stale socket. |
examples/loads/control_load.py |
Updates the @contextmanager return annotation (but currently introduces an invalid Generator[...] parametrization). |
.github/workflows/lint.yml |
Expands the lint matrix to run on Python 3.14. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- examples/control_load.py: annotate the context manager as Generator[None, None, None] so it is valid at runtime on every supported Python version (a bare Generator[None] relies on type-parameter defaults that only exist in newer Python). - events.py: reword the READ_TIMEOUT comment to use the Disconnected event name instead of the looser "Disconnected/reconnect". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
Author
|
Superseded by upstream PR loopj#373 (same branch). This fork-internal PR was only for the Copilot review pass. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The Host Command connections (event stream and command client) read with no timeout and rely on the peer closing the socket to surface a lost connection. On a hard power-off there is no TCP RST, and the event stream's
ECHOkeepalive writes only fail once the OS abandons the connection (about 15 minutes). Until then:Connected/Disconnected/Reconnectednever fire and consumers keep treating the controller as online;read_timeout) but the timeout does not close the socket, and_get_connectiononly reconnects when the connection is closed, so every later request reuses the dead socket and keeps timing out.Net effect: a powered-off or crashed controller is not detected for about 15 minutes, and per-object data does not recover until then.
How it is implemented
_command_client/events.py): read with a timeout of2 * KEEPALIVE_INTERVAL. TheECHOkeepalive round-trips every 60s on a healthy link, so 120s of silence means the link is dead.ClientTimeoutErrorsubclassesClientConnectionError, so it already routes into the existing reconnect path.ClientConnectionError, close the connection so the next_get_connectionopens a fresh socket instead of reusing the stale one. Without this, a read timeout leaves the socket open and the event stream re-emitsReconnectedon a dead socket, flapping every timeout and never recovering.Two small follow-on commits keep CI green: a
Generator(notIterator) return type on a@contextmanagerexample to clear the only pyrightreportDeprecatederror, and Python 3.14 added to the lint matrix.How it was tested
Validated against a real InFusion controller by correlating Home Assistant entity state with ground-truth ICMP reachability across several power-off / power-on cycles:
ECHOround-trips every 60s (ECHO->R:ECHO), so a healthy but quiet link never reaches the read timeout.ruff check,ruff format --check, and pyright 1.1.410 pass across the 3.10 - 3.14 matrix.