Stop monitor_connection thread from crashing on dropped connection by itg-karthicr · Pull Request #62 · IVRTech/pystrix

itg-karthicr · 2026-06-24T23:59:06Z

Closes #3.

Problem

Manager.monitor_connection spawns a daemon thread that pings Asterisk on an interval to watch the connection. When the connection drops, that ping fails and the exception propagated out of the thread's target function. Python dumped a traceback to stderr and the monitor thread died. This is the case the original 2012 reporter hit when the Asterisk service was stopped.

The failure surfaces through more than one exception, and a guard on the loop condition alone cannot fix it: the connection can die between the is_connected() check and the send, so the send itself must be protected.

Fix

Catch both connection-loss exceptions in the ping loop and break, so the monitor stops cleanly:
- ManagerSocketError — the socket broke during transmission.
- ManagerError — send_action's own liveness re-check failed because the connection dropped just after the loop's check.
  This mirrors how _MessageReader.run already handles a dead socket.
Log the reason at debug level (when a logger is configured) so a stopped monitor thread is traceable instead of vanishing silently.
Return the monitoring thread from monitor_connection, so callers can join it. Existing callers that ignore the return value are unaffected.
Harden send_action against a related race: a concurrent disconnect() could clear _connection between the liveness check and the send, producing a raw AttributeError that the monitor's catch would miss and that would crash the thread. send_action now rechecks _connection under the lock, drops the just-registered request, and raises ManagerSocketError. This supersedes the separately-filed #63.

Tests

tests/test_ami_monitor.py: the monitor survives a ManagerSocketError and a ManagerError (each asserted via threading.excepthook to confirm nothing escapes the thread); pings repeatedly while connected and exits cleanly when is_connected() turns False; logs its exit reason; and still propagates an unexpected non-connection error rather than swallowing it.
tests/test_ami_send_action.py: send_action raises ManagerSocketError, not AttributeError, when the connection is cleared mid-send, and drops the registered request.

Written test-first: the regression tests fail against the old code and pass against the fix. Full suite: 50 passing, ruff clean.

When Asterisk stops, the monitor thread's periodic Ping fails: send_action raises ManagerSocketError. Nothing caught it, so the exception propagated out of the thread target and dumped a traceback to stderr while the monitor died silently. Catch ManagerSocketError in the ping loop and break, mirroring how _MessageReader.run already handles a dead socket. Also return the monitoring thread so callers can join it, which makes the behavior testable without relying on sleep timing. Add a regression test that drives a monitor whose send_action raises and asserts the thread exits cleanly with no unhandled exception. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Review (Codex) found the catch was too narrow. The monitor checks is_connected(), then send_action(Ping()) runs its own liveness re-check and raises ManagerError -- a sibling of ManagerSocketError, not a subclass -- if the connection dropped in between. That race re-introduced the #3 crash via an uncaught ManagerError. Catch both ManagerError and ManagerSocketError, and log the reason at debug level when a logger is set, so a stopped monitor thread is traceable. Tests: add the ManagerError race path, the normal ping-until-disconnected exit path (which also pins that a joinable thread is returned), and a guard that an unexpected non-connection error still propagates rather than being swallowed by a too-broad except. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

itg-karthicr · 2026-06-25T01:14:32Z

Review panel result for PR #62.

critical

None.

warning

pystrix/ami/ami.py:604: send_action() can still raise raw AttributeError if disconnect() clears _connection after the liveness check at line 594 and before self._connection.send_message(command). disconnect() sets _connection = None at line 420. The monitor catches ManagerError and ManagerSocketError at line 475, so this path can still kill the monitor thread. Suggested fix: capture and recheck _connection inside _connection_lock, raise a manager exception when missing, and remove the outstanding request when the send never happens.

info

tests/test_ami_monitor.py:88: The test checks three sends, but it does not prove each request is Ping. Suggested fix: assert [request["Action"] for request in pings] == ["Ping"] * 3.
pystrix/ami/ami.py:481: The debug logging branch has no direct test. Low risk.
PR body is stale. It says ManagerSocketError and 42 passing; current code covers ManagerError too, and local verification reports 45 passed.

Verification:

PYTHONPATH=. pytest -q: 45 passed
PYTHONPATH=. pytest -q tests/test_ami_monitor.py: 4 passed
ruff check .: passed
ruff format --check .: passed

) Review (PR #62 panel) found the monitor could still crash: send_action passes its liveness check, registers the request, then a concurrent disconnect() nulls _connection before the send, so send_message is called on None -> AttributeError, which the monitor's (ManagerError, ManagerSocketError) catch misses. Recheck _connection under the lock already held for the send; if it is gone, drop the just-registered request and raise ManagerSocketError instead of dereferencing None. This converts the raw crash into the documented exception the monitor already handles, completing the #3 fix. Supersedes the separately-filed #63. Also from the panel: - Assert each monitored ping is actually a Ping request. - Add a test for the debug-log-on-exit branch. - Add a send_action test proving the race raises ManagerSocketError, not AttributeError. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

From the review panel on the race-fix delta: - Tighten the monitor logging test to assert the exception detail survives into the message, not just the static prefix (a dropped "% exc" would otherwise pass). - Add a synchronous-request variant of the race test. No shipped request sets synchronous = True, so a minimal subclass exercises the (events, finalisers) tracking entry and confirms the cleanup drops it too, so a synchronous caller is not left waiting on events that never arrive. - Reword the ManagerSocketError message to match the "Asterisk manager" family used by the other socket-error raises. - Note in the send_action docstring that a concurrent disconnect surfaces as ManagerSocketError. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

itg-karthicr · 2026-06-25T01:45:10Z

Thanks for the panel pass. Here is how each item was handled.

Warning — AttributeError race in send_action. Fixed in 8a036ba, refined in c06821a. send_action now rechecks _connection inside the same _connection_lock block as the send, drops the just-registered request, and raises ManagerSocketError instead of dereferencing None. The monitor already catches that exception, so the race no longer crashes the thread. This folds in the separately-filed race issue, which is now removed.

Info — the ping test did not prove each request is a Ping. Added assert [request["Action"] for request in pings] == ["Ping"] * 3.

Info — the debug-logging branch had no test. Added test_monitor_connection_logs_reason_on_exit. It asserts both the "monitor stopping" prefix and the underlying reason, so a dropped % exc would fail it.

Info — stale PR body. Refreshed. It now covers ManagerError and the race fix, and reports 48 tests passing.

A fresh review panel checked the race delta across five lanes (correctness, error handling, conventions, consistency, tests) plus an independent Codex pass. All clean. The synchronous-request cleanup path also has its own test now, confirming a synchronous caller is not left waiting on events that never arrive.

itg-karthicr

Review panel follow-up for PR #62.

critical

None.

warning

pystrix/ami/ami.py:613: send_action() still leaves _outstanding_requests populated after send_message() raises ManagerSocketError. The request is registered at line 603. The new cleanup only runs for _connection is None at lines 605-612. A live connection can fail inside _SynchronisedSocket.send_message() at lines 1359-1368, which exits before _serve_outstanding_request() at lines 671-673. Suggested fix: wrap self._connection.send_message(command) in except ManagerSocketError, pop the action id, then re-raise. Add a fake-connection test that raises from send_message() and asserts cleanup for async and synchronous requests.

info

The original None dereference race is fixed at pystrix/ami/ami.py:604-612.
monitor_connection() now catches only connection-loss exceptions at pystrix/ami/ami.py:472-483, and tests/test_ami_monitor.py:122-133 proves unrelated errors still escape.
PR body test count is stale. It says Full suite: 47 passing; local verification is 48 passed and the linked comment says 48.

Verification:

PYTHONPATH=. pytest -q: 48 passed
PYTHONPATH=. pytest -q tests/test_ami_monitor.py tests/test_ami_send_action.py: 7 passed
ruff check .: passed
ruff format --check .: passed

itg-karthicr · 2026-06-25T01:52:56Z

+                raise ManagerSocketError(
+                    "Connection to Asterisk manager closed before the request could be sent"
+                )
            self._connection.send_message(command)


send_action() registers the request at line 603, but this call can raise ManagerSocketError before normal cleanup runs at lines 671-673. The new cleanup only covers _connection is None at lines 605-612. Suggested fix: wrap this call in except ManagerSocketError, pop the action id, then re-raise.

Review follow-up: the disconnect race fix only dropped the registered request on the _connection-is-None path. A live socket that breaks inside send_message raises ManagerSocketError before the normal cleanup at _serve_outstanding_request runs, leaking the just-registered action_id in _outstanding_requests. Wrap send_message in except ManagerSocketError, pop the action_id, and re-raise. Add fake-connection tests that raise from send_message and assert the request is dropped for both async and synchronous requests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

itg-karthicr · 2026-06-25T01:57:22Z

Fixed in 3ea0aac.

Warning — _outstanding_requests leak when send_message raises on a live connection. Good catch. The earlier cleanup only covered the _connection is None path. send_action now wraps send_message in except ManagerSocketError, pops the action id, and re-raises, so a socket that breaks mid-send drops the just-registered request too. Added two fake-connection tests that raise from send_message and assert the request is cleaned up for both async and synchronous requests.

Info — stale test count. The PR body now reads 50 passing.

The other two info notes (the original None-dereference race and the narrowed monitor catch) were already resolved and verified. Full suite: 50 passing, ruff clean.

itg-karthicr

Re-reviewed the 3ea0aac follow-up.

critical

None.

warning

None.

info

None.

The previous warning is addressed. send_action() now catches ManagerSocketError from send_message(), drops the action id, and re-raises. The new tests cover async and synchronous requests.

Verification:

PYTHONPATH=. pytest -q: 50 passed
PYTHONPATH=. pytest -q tests/test_ami_send_action.py tests/test_ami_monitor.py: 9 passed
ruff check .: passed
ruff format --check .: passed

itg-karthicr and others added 2 commits June 24, 2026 19:58

itg-karthicr and others added 2 commits June 24, 2026 21:20

itg-karthicr commented Jun 25, 2026

View reviewed changes

itg-karthicr merged commit d7c902b into master Jun 25, 2026
8 checks passed

itg-karthicr deleted the fix/monitor-connection-socketerror branch June 25, 2026 02:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stop monitor_connection thread from crashing on dropped connection#62

Stop monitor_connection thread from crashing on dropped connection#62
itg-karthicr merged 5 commits into
masterfrom
fix/monitor-connection-socketerror

itg-karthicr commented Jun 24, 2026 •

edited

Loading

Uh oh!

itg-karthicr commented Jun 25, 2026

Uh oh!

itg-karthicr commented Jun 25, 2026

Uh oh!

itg-karthicr left a comment

Uh oh!

itg-karthicr Jun 25, 2026

Uh oh!

itg-karthicr commented Jun 25, 2026

Uh oh!

itg-karthicr left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

itg-karthicr commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Tests

Uh oh!

itg-karthicr commented Jun 25, 2026

Uh oh!

itg-karthicr commented Jun 25, 2026

Uh oh!

itg-karthicr left a comment

Choose a reason for hiding this comment

Uh oh!

itg-karthicr Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

itg-karthicr commented Jun 25, 2026

Uh oh!

itg-karthicr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

itg-karthicr commented Jun 24, 2026 •

edited

Loading