Stop monitor_connection thread from crashing on dropped connection#62
Conversation
When Asterisk stops, the monitor thread's periodic Ping fails: send_action raises ManagerSocketError. Nothing caught it, so the exception propagated out of the thread target and dumped a traceback to stderr while the monitor died silently. Catch ManagerSocketError in the ping loop and break, mirroring how _MessageReader.run already handles a dead socket. Also return the monitoring thread so callers can join it, which makes the behavior testable without relying on sleep timing. Add a regression test that drives a monitor whose send_action raises and asserts the thread exits cleanly with no unhandled exception. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review (Codex) found the catch was too narrow. The monitor checks is_connected(), then send_action(Ping()) runs its own liveness re-check and raises ManagerError -- a sibling of ManagerSocketError, not a subclass -- if the connection dropped in between. That race re-introduced the #3 crash via an uncaught ManagerError. Catch both ManagerError and ManagerSocketError, and log the reason at debug level when a logger is set, so a stopped monitor thread is traceable. Tests: add the ManagerError race path, the normal ping-until-disconnected exit path (which also pins that a joinable thread is returned), and a guard that an unexpected non-connection error still propagates rather than being swallowed by a too-broad except. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Review panel result for PR #62. critical
warning
info
Verification:
|
) Review (PR #62 panel) found the monitor could still crash: send_action passes its liveness check, registers the request, then a concurrent disconnect() nulls _connection before the send, so send_message is called on None -> AttributeError, which the monitor's (ManagerError, ManagerSocketError) catch misses. Recheck _connection under the lock already held for the send; if it is gone, drop the just-registered request and raise ManagerSocketError instead of dereferencing None. This converts the raw crash into the documented exception the monitor already handles, completing the #3 fix. Supersedes the separately-filed #63. Also from the panel: - Assert each monitored ping is actually a Ping request. - Add a test for the debug-log-on-exit branch. - Add a send_action test proving the race raises ManagerSocketError, not AttributeError. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
From the review panel on the race-fix delta: - Tighten the monitor logging test to assert the exception detail survives into the message, not just the static prefix (a dropped "% exc" would otherwise pass). - Add a synchronous-request variant of the race test. No shipped request sets synchronous = True, so a minimal subclass exercises the (events, finalisers) tracking entry and confirms the cleanup drops it too, so a synchronous caller is not left waiting on events that never arrive. - Reword the ManagerSocketError message to match the "Asterisk manager" family used by the other socket-error raises. - Note in the send_action docstring that a concurrent disconnect surfaces as ManagerSocketError. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for the panel pass. Here is how each item was handled. Warning — Info — the ping test did not prove each request is a Info — the debug-logging branch had no test. Added Info — stale PR body. Refreshed. It now covers A fresh review panel checked the race delta across five lanes (correctness, error handling, conventions, consistency, tests) plus an independent Codex pass. All clean. The synchronous-request cleanup path also has its own test now, confirming a synchronous caller is not left waiting on events that never arrive. |
itg-karthicr
left a comment
There was a problem hiding this comment.
Review panel follow-up for PR #62.
critical
- None.
warning
pystrix/ami/ami.py:613:send_action()still leaves_outstanding_requestspopulated aftersend_message()raisesManagerSocketError. The request is registered at line 603. The new cleanup only runs for_connection is Noneat lines 605-612. A live connection can fail inside_SynchronisedSocket.send_message()at lines 1359-1368, which exits before_serve_outstanding_request()at lines 671-673. Suggested fix: wrapself._connection.send_message(command)inexcept ManagerSocketError, pop the action id, then re-raise. Add a fake-connection test that raises fromsend_message()and asserts cleanup for async and synchronous requests.
info
- The original
Nonedereference race is fixed atpystrix/ami/ami.py:604-612. monitor_connection()now catches only connection-loss exceptions atpystrix/ami/ami.py:472-483, andtests/test_ami_monitor.py:122-133proves unrelated errors still escape.- PR body test count is stale. It says
Full suite: 47 passing; local verification is 48 passed and the linked comment says 48.
Verification:
PYTHONPATH=. pytest -q: 48 passedPYTHONPATH=. pytest -q tests/test_ami_monitor.py tests/test_ami_send_action.py: 7 passedruff check .: passedruff format --check .: passed
| raise ManagerSocketError( | ||
| "Connection to Asterisk manager closed before the request could be sent" | ||
| ) | ||
| self._connection.send_message(command) |
There was a problem hiding this comment.
send_action() registers the request at line 603, but this call can raise ManagerSocketError before normal cleanup runs at lines 671-673. The new cleanup only covers _connection is None at lines 605-612. Suggested fix: wrap this call in except ManagerSocketError, pop the action id, then re-raise.
Review follow-up: the disconnect race fix only dropped the registered request on the _connection-is-None path. A live socket that breaks inside send_message raises ManagerSocketError before the normal cleanup at _serve_outstanding_request runs, leaking the just-registered action_id in _outstanding_requests. Wrap send_message in except ManagerSocketError, pop the action_id, and re-raise. Add fake-connection tests that raise from send_message and assert the request is dropped for both async and synchronous requests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Fixed in Warning — Info — stale test count. The PR body now reads 50 passing. The other two info notes (the original |
itg-karthicr
left a comment
There was a problem hiding this comment.
Re-reviewed the 3ea0aac follow-up.
critical
- None.
warning
- None.
info
- None.
The previous warning is addressed. send_action() now catches ManagerSocketError from send_message(), drops the action id, and re-raises. The new tests cover async and synchronous requests.
Verification:
PYTHONPATH=. pytest -q: 50 passedPYTHONPATH=. pytest -q tests/test_ami_send_action.py tests/test_ami_monitor.py: 9 passedruff check .: passedruff format --check .: passed
Closes #3.
Problem
Manager.monitor_connectionspawns a daemon thread that pings Asterisk on an interval to watch the connection. When the connection drops, that ping fails and the exception propagated out of the thread's target function. Python dumped a traceback to stderr and the monitor thread died. This is the case the original 2012 reporter hit when the Asterisk service was stopped.The failure surfaces through more than one exception, and a guard on the loop condition alone cannot fix it: the connection can die between the
is_connected()check and the send, so the send itself must be protected.Fix
break, so the monitor stops cleanly:ManagerSocketError— the socket broke during transmission.ManagerError—send_action's own liveness re-check failed because the connection dropped just after the loop's check.This mirrors how
_MessageReader.runalready handles a dead socket.monitor_connection, so callers can join it. Existing callers that ignore the return value are unaffected.send_actionagainst a related race: a concurrentdisconnect()could clear_connectionbetween the liveness check and the send, producing a rawAttributeErrorthat the monitor's catch would miss and that would crash the thread.send_actionnow rechecks_connectionunder the lock, drops the just-registered request, and raisesManagerSocketError. This supersedes the separately-filed #63.Tests
tests/test_ami_monitor.py: the monitor survives aManagerSocketErrorand aManagerError(each asserted viathreading.excepthookto confirm nothing escapes the thread); pings repeatedly while connected and exits cleanly whenis_connected()turns False; logs its exit reason; and still propagates an unexpected non-connection error rather than swallowing it.tests/test_ami_send_action.py:send_actionraisesManagerSocketError, notAttributeError, when the connection is cleared mid-send, and drops the registered request.Written test-first: the regression tests fail against the old code and pass against the fix. Full suite: 50 passing, ruff clean.