Fix MCU lockup recovery: hard-reboot watchdog, fleet-wide Slack alert, defense-in-depth WDTs by jantman · Pull Request #145 · DecaturMakers/machine-access-control

jantman · 2026-05-12T00:25:14Z

Summary

Follow-up to #137 addressing items 2, 3, and 4 of docs/2026-05-11-mcu-lockup-analysis.md. Item 1 (replace SATA cable / inspect drive on palantir) is operator-owned and tracked outside this PR.

#	Layer	Change
2	Firmware	Liveness watchdog calls `App.reboot()` (unconditional `esp_restart()`) instead of `App.safe_reboot()` — the latter hangs in `run_safe_shutdown_hooks()` when the `http_request` component cannot cleanly close, which is exactly the condition the watchdog is trying to recover from.
3	Server	New `FleetTimeoutTracker` fires a Slack alert when ≥ `FLEET_TIMEOUT_THRESHOLD` (default 2) distinct machines record a state-save timeout within `FLEET_TIMEOUT_WINDOW_SEC` (default 60 s), with a `FLEET_TIMEOUT_COOLDOWN_SEC` (default 300 s) between fires. Catches the 2026-05-11 pattern in which every machine's per-machine counter went 0 → 1 simultaneously and the existing per-machine rule never tripped.
4	Firmware	Defense-in-depth: enable interrupt WDT (`CONFIG_ESP_INT_WDT`, 300 ms) and bootloader/RTC WDT (`CONFIG_BOOTLOADER_WDT_ENABLE` / `_TIME_MS=9000`) on top of the existing task WDT panic.

What's in the PR

Firmware (esphome-configs/2025.11.2/no-current-input.yaml): one-line App.safe_reboot() → App.reboot() with comment explaining why; three new sdkconfig_options for the int / bootloader WDTs with a block comment summarizing the three layers.
Server (src/dm_mac/models/machine.py): new FleetTimeoutTracker class (~30 lines plus docstring) and three new module-level config constants. MachineState._record_save_timeout now also calls _notify_fleet_save_timeout, a sibling of the existing _notify_save_timeout that consults the tracker and posts to the same SLACK_CONTROL_CHANNEL_ID. App factory in src/dm_mac/__init__.py instantiates one tracker per app.
Tests: 8 new unit tests on the tracker (distinct counting, window aging, cooldown arming, cooldown expiration, configurable threshold) and 3 integration tests on the save-cache path (fires through tracker; tracker absent → silent; below threshold → silent). Existing test_timeout_raises_and_increments_counter updated to mock current_app since the new call no longer short-circuits before touching app config.
Docs: CLAUDE.md updated to describe both Slack-notification rules. The full incident analysis was added in 95b97be (already on main).

Test plan

Captures the second post-PR-#137 lockup of the same four MCUs. Root cause is a SATA link-layer reset on palantir (third such event in a week); PR #137's server-side 503 path worked as designed but the firmware liveness watchdog hung in App.safe_reboot()'s synchronous shutdown hooks, leaving three relay-off MCUs unresponsive until manual power-cycle ~38 min later. Documents timeline, root cause, what worked vs. what didn't, and prioritized recommendations (fix SATA hardware, switch watchdog to App.reboot(), add fleet-wide Slack alert). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, defense-in-depth WDTs. Follow-up to PR #137 addressing items 2, 3, and 4 of docs/2026-05-11-mcu-lockup-analysis.md. Firmware (esphome-configs/2025.11.2/no-current-input.yaml): - Item 2: change watchdog action from App.safe_reboot() to App.reboot(). safe_reboot() runs every component's synchronous on_shutdown() hook and waits indefinitely for it to return; http_request's shutdown hook blocks when the network/server is unreachable, which is exactly the condition the watchdog is trying to recover from. On 2026-05-11 this left three relay-off MCUs unresponsive for ~40 minutes until manually power-cycled. App.reboot() calls esp_restart() directly. - Item 4: enable interrupt WDT (CONFIG_ESP_INT_WDT, 300 ms) to catch loop blocks inside critical sections, and bootloader/RTC WDT (CONFIG_BOOTLOADER_WDT_ENABLE / _TIME_MS=9000) as a hardware last-resort beyond the existing task WDT panic. Server (src/dm_mac/models/machine.py, src/dm_mac/__init__.py): - Item 3: add FleetTimeoutTracker for cross-machine timeout accounting. Fires a Slack notification when at least FLEET_TIMEOUT_THRESHOLD distinct machines (default 2) record a state-save timeout within FLEET_TIMEOUT_WINDOW_SEC (default 60 s), with FLEET_TIMEOUT_COOLDOWN_SEC (default 300 s) between fires. The existing per-machine >=2-lifetime rule is preserved unchanged; the new rule catches the 2026-05-11 pattern (four machines each at count 1 simultaneously, no notification fired). Tests: - 8 new TestFleetTimeoutTracker tests covering distinct-machine counting, window expiration, cooldown, configurable threshold. - 3 new TestFleetTimeoutNotification integration tests covering the Slack-fire-from-save-cache-timeout path and absent-tracker / below-threshold negative paths. - Existing test_timeout_raises_and_increments_counter updated to mock current_app, since the new fleet-wide call no longer short-circuits before touching app config. Docs: - CLAUDE.md updated to describe both Slack-notification rules. All 293 tests pass; 97% coverage; mypy / pre-commit / docs all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-12T00:26:55Z

Coverage Report

File	Stmts	Miss	Branch	BrPart	Cover	Missing
src/dm_mac
__init__.py	73	0	6	0	100%
cli_utils.py	15	0	0	0	100%
neon_fob_adder.py	232	15	60	5	93%	79, 116–117, 124, 270, 333–334, 341, 364–367, 454–456
neongetter.py	211	1	54	3	99%	309
slack_handler.py	165	0	42	0	100%
utils.py	25	0	4	0	100%
src/dm_mac/models
__init__.py	0	0	0	0	100%
api_schemas.py	34	0	0	0	100%
machine.py	580	16	194	16	97%	589, 669, 961–963, 1075–1084, 1142
users.py	103	0	32	0	100%
src/dm_mac/views
__init__.py	0	0	0	0	100%
api.py	32	0	0	0	100%
machine.py	103	0	12	0	100%
prometheus.py	132	0	12	1	100%
TOTAL	1705	32	416	25	98%

Tests	Skipped	Failures	Errors	Time
295	0 💤	0 ❌	0 🔥	20.092s ⏱️

Copilot

Pull request overview

This PR is a follow-up hardening pass on the MCU lockup recovery path and on server observability for disk-related stalls, based on the 2026-05-11 incident analysis. It updates the ESPHome watchdog behavior to force an unconditional reboot, adds defense-in-depth watchdog layers in firmware, and introduces a server-side fleet-wide Slack alert for correlated state-save timeouts across multiple machines.

Changes:

Firmware: switch watchdog action from App.safe_reboot() to App.reboot(); enable interrupt WDT and bootloader/RTC WDT via sdkconfig_options.
Server: add FleetTimeoutTracker plus fleet-wide timeout window/threshold/cooldown constants; emit Slack alert when multiple distinct machines time out within the window.
Tests/docs: add unit/integration coverage for the fleet tracker and update docs to describe the two Slack paging rules.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`tests/models/test_machine_state.py`	Adds unit tests for `FleetTimeoutTracker` and integration tests verifying fleet-wide Slack notification behavior via `save_cache()` timeouts.
`src/dm_mac/models/machine.py`	Implements `FleetTimeoutTracker`, fleet-wide config constants, and hooks fleet-wide notification into the state-save timeout path.
`src/dm_mac/__init__.py`	Instantiates one `FleetTimeoutTracker` per app and stores it in `current_app.config`.
`esphome-configs/2025.11.2/no-current-input.yaml`	Enables additional watchdog layers and changes liveness watchdog reboot to unconditional `App.reboot()`.
`docs/2026-05-11-mcu-lockup-analysis.md`	Adds/updates the incident analysis and recommendations that this PR implements.
`CLAUDE.md`	Documents the new fleet-wide Slack notification rule alongside the existing per-machine rule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

claude · 2026-05-12T00:33:17Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Covers two PRs merged after 0.12.0: - PR #145 (MCU lockup recovery): App.safe_reboot() → App.reboot() in firmware liveness watchdog, new FleetTimeoutTracker for fleet-wide Slack alerting, interrupt + bootloader/RTC WDTs as defense-in-depth. - PR #146 (dependency refresh): poetry update rollup covering all six Dependabot PRs (#139–#144) plus mypy 2.0.0 → 2.1.0 and others; CI poetry pin bumped to 2.4.1. Also folds in the unreleased 0.12.1 (ESPHome 2025.11.2 firmware compilation fixes from 6ca8ff7). Minor version bump (0.12.x → 0.13.0) reflects the new FleetTimeoutTracker feature; the firmware fix and dep refresh alone would have been a patch bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jantman and others added 2 commits May 11, 2026 20:08

jantman requested a review from Copilot May 12, 2026 00:25

Copilot started reviewing on behalf of jantman May 12, 2026 00:26 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

jantman merged commit 6de4fab into main May 12, 2026
21 checks passed

jantman deleted the fix/mcu-lockup-2026-05-11-followup branch May 12, 2026 09:49

jantman mentioned this pull request May 12, 2026

chore(deps): refresh all deps (supersedes Dependabot PRs #139–#144) #146

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MCU lockup recovery: hard-reboot watchdog, fleet-wide Slack alert, defense-in-depth WDTs#145

Fix MCU lockup recovery: hard-reboot watchdog, fleet-wide Slack alert, defense-in-depth WDTs#145
jantman merged 2 commits into
mainfrom
fix/mcu-lockup-2026-05-11-followup

jantman commented May 12, 2026

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jantman commented May 12, 2026

Summary

What's in the PR

Test plan

See also

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

claude Bot commented May 12, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants