Skip to content

Detect dead connections via a read timeout and reconnect#4

Closed
ptr727 wants to merge 4 commits into
mainfrom
fix/event-stream-read-timeout
Closed

Detect dead connections via a read timeout and reconnect#4
ptr727 wants to merge 4 commits into
mainfrom
fix/event-stream-read-timeout

Conversation

@ptr727

@ptr727 ptr727 commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Why

The Host Command connections (event stream and command client) read with no timeout and rely on the peer closing the socket to surface a lost connection. On a hard power-off there is no TCP RST, and the event stream's ECHO keepalive writes only fail once the OS abandons the connection (about 15 minutes). Until then:

  • the event stream read blocks forever, so Connected / Disconnected / Reconnected never fire and consumers keep treating the controller as online;
  • the command client reads time out (it already passes read_timeout) but the timeout does not close the socket, and _get_connection only reconnects when the connection is closed, so every later request reuses the dead socket and keeps timing out.

Net effect: a powered-off or crashed controller is not detected for about 15 minutes, and per-object data does not recover until then.

How it is implemented

  • Event stream (_command_client/events.py): read with a timeout of 2 * KEEPALIVE_INTERVAL. The ECHO keepalive round-trips every 60s on a healthy link, so 120s of silence means the link is dead. ClientTimeoutError subclasses ClientConnectionError, so it already routes into the existing reconnect path.
  • Both clients: on ClientConnectionError, close the connection so the next _get_connection opens a fresh socket instead of reusing the stale one. Without this, a read timeout leaves the socket open and the event stream re-emits Reconnected on a dead socket, flapping every timeout and never recovering.

Two small follow-on commits keep CI green: a Generator (not Iterator) return type on a @contextmanager example to clear the only pyright reportDeprecated error, and Python 3.14 added to the lint matrix.

How it was tested

Validated against a real InFusion controller by correlating Home Assistant entity state with ground-truth ICMP reachability across several power-off / power-on cycles:

  • Connection loss detected about 108-120s after the link drops (previously not detected within 8+ minutes).
  • Connection state no longer flaps while the controller is unreachable (single latched transition).
  • Event stream and per-object command data both recover within a few seconds of the controller returning (previously stuck until the ~15 min OS timeout).
  • Confirmed ECHO round-trips every 60s (ECHO -> R:ECHO), so a healthy but quiet link never reaches the read timeout.

ruff check, ruff format --check, and pyright 1.1.410 pass across the 3.10 - 3.14 matrix.

ptr727 and others added 3 commits June 1, 2026 10:37
The Host Command connections (event stream and command client) read with no
timeout, relying on the peer to close the socket to surface a lost connection.
On a hard power-off there is no TCP RST, and the event stream's ECHO keepalive
writes only fail once the OS abandons the connection (~15 min). Until then:

- the event stream's read blocks forever, so Connected/Disconnected/Reconnected
  events never fire and consumers think the controller is still online;
- the command client's reads time out (it already passes read_timeout) but the
  timeout does not close the socket, and _get_connection only reconnects when the
  connection is closed, so every subsequent request reuses the dead socket and
  keeps timing out.

Fix both:
- Event stream: read with a timeout (2x KEEPALIVE_INTERVAL). The ECHO keepalive
  round-trips every 60s on a healthy link (verified: ECHO -> R:ECHO), so a 120s
  silence means the link is dead. ClientTimeoutError is a ClientConnectionError,
  so it already routes to the reconnect path.
- Both clients: close the connection on ClientConnectionError so the next
  _get_connection establishes a fresh socket instead of reusing the stale one
  (which otherwise flaps Reconnected every timeout and never recovers).

Validated end-to-end against a real controller (ping-correlated power-off/on):
connection loss is detected ~120s after the link dies, the connection state
no longer flaps while down, and both the event stream and per-object data
recover within seconds of the controller returning (previously stuck until the
~15 min OS timeout).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
pyright flags annotating a @contextmanager function's return type as
Iterator as deprecated (reportDeprecated); use Generator instead. This is
the only error currently failing the Lint workflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover Python 3.14 in the pre-commit lint/type-check matrix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 17:56

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resiliency of the Host Command event stream and command client by detecting “dead” TCP connections via read timeouts and forcing a reconnect, reducing the time-to-detect for hard power-offs/crashes.

Changes:

  • Add an event-stream read timeout tied to the keepalive interval so stalled reads surface disconnects promptly.
  • Ensure both event stream and command client close the underlying socket on ClientConnectionError so subsequent operations create a fresh connection.
  • Update an example type annotation and extend the lint workflow matrix to include Python 3.14.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/aiovantage/_command_client/events.py Adds a read timeout to detect dead event-stream connections and explicitly closes the socket on connection errors to enable clean reconnects.
src/aiovantage/_command_client/client.py Closes the command client connection on ClientConnectionError so future requests don’t reuse a stale socket.
examples/loads/control_load.py Updates the @contextmanager return annotation (but currently introduces an invalid Generator[...] parametrization).
.github/workflows/lint.yml Expands the lint matrix to run on Python 3.14.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/loads/control_load.py Outdated
Comment thread src/aiovantage/_command_client/events.py Outdated
- examples/control_load.py: annotate the context manager as
  Generator[None, None, None] so it is valid at runtime on every supported
  Python version (a bare Generator[None] relies on type-parameter defaults
  that only exist in newer Python).
- events.py: reword the READ_TIMEOUT comment to use the Disconnected event
  name instead of the looser "Disconnected/reconnect".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@ptr727

ptr727 commented Jun 1, 2026

Copy link
Copy Markdown
Owner Author

Superseded by upstream PR loopj#373 (same branch). This fork-internal PR was only for the Copilot review pass.

@ptr727 ptr727 closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants