Skip to content

Stop monitor_connection thread from crashing on dropped connection#62

Merged
itg-karthicr merged 5 commits into
masterfrom
fix/monitor-connection-socketerror
Jun 25, 2026
Merged

Stop monitor_connection thread from crashing on dropped connection#62
itg-karthicr merged 5 commits into
masterfrom
fix/monitor-connection-socketerror

Conversation

@itg-karthicr

@itg-karthicr itg-karthicr commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Closes #3.

Problem

Manager.monitor_connection spawns a daemon thread that pings Asterisk on an interval to watch the connection. When the connection drops, that ping fails and the exception propagated out of the thread's target function. Python dumped a traceback to stderr and the monitor thread died. This is the case the original 2012 reporter hit when the Asterisk service was stopped.

The failure surfaces through more than one exception, and a guard on the loop condition alone cannot fix it: the connection can die between the is_connected() check and the send, so the send itself must be protected.

Fix

  • Catch both connection-loss exceptions in the ping loop and break, so the monitor stops cleanly:
    • ManagerSocketError — the socket broke during transmission.
    • ManagerErrorsend_action's own liveness re-check failed because the connection dropped just after the loop's check.
      This mirrors how _MessageReader.run already handles a dead socket.
  • Log the reason at debug level (when a logger is configured) so a stopped monitor thread is traceable instead of vanishing silently.
  • Return the monitoring thread from monitor_connection, so callers can join it. Existing callers that ignore the return value are unaffected.
  • Harden send_action against a related race: a concurrent disconnect() could clear _connection between the liveness check and the send, producing a raw AttributeError that the monitor's catch would miss and that would crash the thread. send_action now rechecks _connection under the lock, drops the just-registered request, and raises ManagerSocketError. This supersedes the separately-filed #63.

Tests

  • tests/test_ami_monitor.py: the monitor survives a ManagerSocketError and a ManagerError (each asserted via threading.excepthook to confirm nothing escapes the thread); pings repeatedly while connected and exits cleanly when is_connected() turns False; logs its exit reason; and still propagates an unexpected non-connection error rather than swallowing it.
  • tests/test_ami_send_action.py: send_action raises ManagerSocketError, not AttributeError, when the connection is cleared mid-send, and drops the registered request.

Written test-first: the regression tests fail against the old code and pass against the fix. Full suite: 50 passing, ruff clean.

itg-karthicr and others added 2 commits June 24, 2026 19:58
When Asterisk stops, the monitor thread's periodic Ping fails: send_action
raises ManagerSocketError. Nothing caught it, so the exception propagated
out of the thread target and dumped a traceback to stderr while the monitor
died silently.

Catch ManagerSocketError in the ping loop and break, mirroring how
_MessageReader.run already handles a dead socket. Also return the monitoring
thread so callers can join it, which makes the behavior testable without
relying on sleep timing.

Add a regression test that drives a monitor whose send_action raises and
asserts the thread exits cleanly with no unhandled exception.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review (Codex) found the catch was too narrow. The monitor checks
is_connected(), then send_action(Ping()) runs its own liveness re-check
and raises ManagerError -- a sibling of ManagerSocketError, not a subclass
-- if the connection dropped in between. That race re-introduced the #3
crash via an uncaught ManagerError.

Catch both ManagerError and ManagerSocketError, and log the reason at debug
level when a logger is set, so a stopped monitor thread is traceable.

Tests: add the ManagerError race path, the normal ping-until-disconnected
exit path (which also pins that a joinable thread is returned), and a guard
that an unexpected non-connection error still propagates rather than being
swallowed by a too-broad except.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copy link
Copy Markdown
Contributor Author

Review panel result for PR #62.

critical

  • None.

warning

  • pystrix/ami/ami.py:604: send_action() can still raise raw AttributeError if disconnect() clears _connection after the liveness check at line 594 and before self._connection.send_message(command). disconnect() sets _connection = None at line 420. The monitor catches ManagerError and ManagerSocketError at line 475, so this path can still kill the monitor thread. Suggested fix: capture and recheck _connection inside _connection_lock, raise a manager exception when missing, and remove the outstanding request when the send never happens.

info

  • tests/test_ami_monitor.py:88: The test checks three sends, but it does not prove each request is Ping. Suggested fix: assert [request["Action"] for request in pings] == ["Ping"] * 3.
  • pystrix/ami/ami.py:481: The debug logging branch has no direct test. Low risk.
  • PR body is stale. It says ManagerSocketError and 42 passing; current code covers ManagerError too, and local verification reports 45 passed.

Verification:

  • PYTHONPATH=. pytest -q: 45 passed
  • PYTHONPATH=. pytest -q tests/test_ami_monitor.py: 4 passed
  • ruff check .: passed
  • ruff format --check .: passed

itg-karthicr and others added 2 commits June 24, 2026 21:20
)

Review (PR #62 panel) found the monitor could still crash: send_action
passes its liveness check, registers the request, then a concurrent
disconnect() nulls _connection before the send, so send_message is called
on None -> AttributeError, which the monitor's (ManagerError,
ManagerSocketError) catch misses.

Recheck _connection under the lock already held for the send; if it is gone,
drop the just-registered request and raise ManagerSocketError instead of
dereferencing None. This converts the raw crash into the documented
exception the monitor already handles, completing the #3 fix. Supersedes the
separately-filed #63.

Also from the panel:
- Assert each monitored ping is actually a Ping request.
- Add a test for the debug-log-on-exit branch.
- Add a send_action test proving the race raises ManagerSocketError, not
  AttributeError.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
From the review panel on the race-fix delta:
- Tighten the monitor logging test to assert the exception detail survives
  into the message, not just the static prefix (a dropped "% exc" would
  otherwise pass).
- Add a synchronous-request variant of the race test. No shipped request sets
  synchronous = True, so a minimal subclass exercises the (events, finalisers)
  tracking entry and confirms the cleanup drops it too, so a synchronous caller
  is not left waiting on events that never arrive.
- Reword the ManagerSocketError message to match the "Asterisk manager" family
  used by the other socket-error raises.
- Note in the send_action docstring that a concurrent disconnect surfaces as
  ManagerSocketError.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@itg-karthicr

Copy link
Copy Markdown
Contributor Author

Thanks for the panel pass. Here is how each item was handled.

Warning — AttributeError race in send_action. Fixed in 8a036ba, refined in c06821a. send_action now rechecks _connection inside the same _connection_lock block as the send, drops the just-registered request, and raises ManagerSocketError instead of dereferencing None. The monitor already catches that exception, so the race no longer crashes the thread. This folds in the separately-filed race issue, which is now removed.

Info — the ping test did not prove each request is a Ping. Added assert [request["Action"] for request in pings] == ["Ping"] * 3.

Info — the debug-logging branch had no test. Added test_monitor_connection_logs_reason_on_exit. It asserts both the "monitor stopping" prefix and the underlying reason, so a dropped % exc would fail it.

Info — stale PR body. Refreshed. It now covers ManagerError and the race fix, and reports 48 tests passing.

A fresh review panel checked the race delta across five lanes (correctness, error handling, conventions, consistency, tests) plus an independent Codex pass. All clean. The synchronous-request cleanup path also has its own test now, confirming a synchronous caller is not left waiting on events that never arrive.

@itg-karthicr itg-karthicr left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review panel follow-up for PR #62.

critical

  • None.

warning

  • pystrix/ami/ami.py:613: send_action() still leaves _outstanding_requests populated after send_message() raises ManagerSocketError. The request is registered at line 603. The new cleanup only runs for _connection is None at lines 605-612. A live connection can fail inside _SynchronisedSocket.send_message() at lines 1359-1368, which exits before _serve_outstanding_request() at lines 671-673. Suggested fix: wrap self._connection.send_message(command) in except ManagerSocketError, pop the action id, then re-raise. Add a fake-connection test that raises from send_message() and asserts cleanup for async and synchronous requests.

info

  • The original None dereference race is fixed at pystrix/ami/ami.py:604-612.
  • monitor_connection() now catches only connection-loss exceptions at pystrix/ami/ami.py:472-483, and tests/test_ami_monitor.py:122-133 proves unrelated errors still escape.
  • PR body test count is stale. It says Full suite: 47 passing; local verification is 48 passed and the linked comment says 48.

Verification:

  • PYTHONPATH=. pytest -q: 48 passed
  • PYTHONPATH=. pytest -q tests/test_ami_monitor.py tests/test_ami_send_action.py: 7 passed
  • ruff check .: passed
  • ruff format --check .: passed

Comment thread pystrix/ami/ami.py Outdated
raise ManagerSocketError(
"Connection to Asterisk manager closed before the request could be sent"
)
self._connection.send_message(command)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_action() registers the request at line 603, but this call can raise ManagerSocketError before normal cleanup runs at lines 671-673. The new cleanup only covers _connection is None at lines 605-612. Suggested fix: wrap this call in except ManagerSocketError, pop the action id, then re-raise.

Review follow-up: the disconnect race fix only dropped the registered
request on the _connection-is-None path. A live socket that breaks inside
send_message raises ManagerSocketError before the normal cleanup at
_serve_outstanding_request runs, leaking the just-registered action_id in
_outstanding_requests.

Wrap send_message in except ManagerSocketError, pop the action_id, and
re-raise. Add fake-connection tests that raise from send_message and assert
the request is dropped for both async and synchronous requests.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@itg-karthicr

Copy link
Copy Markdown
Contributor Author

Fixed in 3ea0aac.

Warning — _outstanding_requests leak when send_message raises on a live connection. Good catch. The earlier cleanup only covered the _connection is None path. send_action now wraps send_message in except ManagerSocketError, pops the action id, and re-raises, so a socket that breaks mid-send drops the just-registered request too. Added two fake-connection tests that raise from send_message and assert the request is cleaned up for both async and synchronous requests.

Info — stale test count. The PR body now reads 50 passing.

The other two info notes (the original None-dereference race and the narrowed monitor catch) were already resolved and verified. Full suite: 50 passing, ruff clean.

@itg-karthicr itg-karthicr left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the 3ea0aac follow-up.

critical

  • None.

warning

  • None.

info

  • None.

The previous warning is addressed. send_action() now catches ManagerSocketError from send_message(), drops the action id, and re-raises. The new tests cover async and synchronous requests.

Verification:

  • PYTHONPATH=. pytest -q: 50 passed
  • PYTHONPATH=. pytest -q tests/test_ami_send_action.py tests/test_ami_monitor.py: 9 passed
  • ruff check .: passed
  • ruff format --check .: passed

@itg-karthicr itg-karthicr merged commit d7c902b into master Jun 25, 2026
8 checks passed
@itg-karthicr itg-karthicr deleted the fix/monitor-connection-socketerror branch June 25, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

catching exception in monitor_connection() method of Manager class

1 participant