Fix MCU lockup recovery: hard-reboot watchdog, fleet-wide Slack alert, defense-in-depth WDTs#145
Merged
Merged
Conversation
Captures the second post-PR-#137 lockup of the same four MCUs. Root cause is a SATA link-layer reset on palantir (third such event in a week); PR #137's server-side 503 path worked as designed but the firmware liveness watchdog hung in App.safe_reboot()'s synchronous shutdown hooks, leaving three relay-off MCUs unresponsive until manual power-cycle ~38 min later. Documents timeline, root cause, what worked vs. what didn't, and prioritized recommendations (fix SATA hardware, switch watchdog to App.reboot(), add fleet-wide Slack alert). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, defense-in-depth WDTs. Follow-up to PR #137 addressing items 2, 3, and 4 of docs/2026-05-11-mcu-lockup-analysis.md. Firmware (esphome-configs/2025.11.2/no-current-input.yaml): - Item 2: change watchdog action from App.safe_reboot() to App.reboot(). safe_reboot() runs every component's synchronous on_shutdown() hook and waits indefinitely for it to return; http_request's shutdown hook blocks when the network/server is unreachable, which is exactly the condition the watchdog is trying to recover from. On 2026-05-11 this left three relay-off MCUs unresponsive for ~40 minutes until manually power-cycled. App.reboot() calls esp_restart() directly. - Item 4: enable interrupt WDT (CONFIG_ESP_INT_WDT, 300 ms) to catch loop blocks inside critical sections, and bootloader/RTC WDT (CONFIG_BOOTLOADER_WDT_ENABLE / _TIME_MS=9000) as a hardware last-resort beyond the existing task WDT panic. Server (src/dm_mac/models/machine.py, src/dm_mac/__init__.py): - Item 3: add FleetTimeoutTracker for cross-machine timeout accounting. Fires a Slack notification when at least FLEET_TIMEOUT_THRESHOLD distinct machines (default 2) record a state-save timeout within FLEET_TIMEOUT_WINDOW_SEC (default 60 s), with FLEET_TIMEOUT_COOLDOWN_SEC (default 300 s) between fires. The existing per-machine >=2-lifetime rule is preserved unchanged; the new rule catches the 2026-05-11 pattern (four machines each at count 1 simultaneously, no notification fired). Tests: - 8 new TestFleetTimeoutTracker tests covering distinct-machine counting, window expiration, cooldown, configurable threshold. - 3 new TestFleetTimeoutNotification integration tests covering the Slack-fire-from-save-cache-timeout path and absent-tracker / below-threshold negative paths. - Existing test_timeout_raises_and_increments_counter updated to mock current_app, since the new fleet-wide call no longer short-circuits before touching app config. Docs: - CLAUDE.md updated to describe both Slack-notification rules. All 293 tests pass; 97% coverage; mypy / pre-commit / docs all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Coverage Report
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Contributor
There was a problem hiding this comment.
Pull request overview
This PR is a follow-up hardening pass on the MCU lockup recovery path and on server observability for disk-related stalls, based on the 2026-05-11 incident analysis. It updates the ESPHome watchdog behavior to force an unconditional reboot, adds defense-in-depth watchdog layers in firmware, and introduces a server-side fleet-wide Slack alert for correlated state-save timeouts across multiple machines.
Changes:
- Firmware: switch watchdog action from
App.safe_reboot()toApp.reboot(); enable interrupt WDT and bootloader/RTC WDT viasdkconfig_options. - Server: add
FleetTimeoutTrackerplus fleet-wide timeout window/threshold/cooldown constants; emit Slack alert when multiple distinct machines time out within the window. - Tests/docs: add unit/integration coverage for the fleet tracker and update docs to describe the two Slack paging rules.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
tests/models/test_machine_state.py |
Adds unit tests for FleetTimeoutTracker and integration tests verifying fleet-wide Slack notification behavior via save_cache() timeouts. |
src/dm_mac/models/machine.py |
Implements FleetTimeoutTracker, fleet-wide config constants, and hooks fleet-wide notification into the state-save timeout path. |
src/dm_mac/__init__.py |
Instantiates one FleetTimeoutTracker per app and stores it in current_app.config. |
esphome-configs/2025.11.2/no-current-input.yaml |
Enables additional watchdog layers and changes liveness watchdog reboot to unconditional App.reboot(). |
docs/2026-05-11-mcu-lockup-analysis.md |
Adds/updates the incident analysis and recommendations that this PR implements. |
CLAUDE.md |
Documents the new fleet-wide Slack notification rule alongside the existing per-machine rule. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
5 tasks
jantman
added a commit
that referenced
this pull request
May 12, 2026
Covers two PRs merged after 0.12.0: - PR #145 (MCU lockup recovery): App.safe_reboot() → App.reboot() in firmware liveness watchdog, new FleetTimeoutTracker for fleet-wide Slack alerting, interrupt + bootloader/RTC WDTs as defense-in-depth. - PR #146 (dependency refresh): poetry update rollup covering all six Dependabot PRs (#139–#144) plus mypy 2.0.0 → 2.1.0 and others; CI poetry pin bumped to 2.4.1. Also folds in the unreleased 0.12.1 (ESPHome 2025.11.2 firmware compilation fixes from 6ca8ff7). Minor version bump (0.12.x → 0.13.0) reflects the new FleetTimeoutTracker feature; the firmware fix and dep refresh alone would have been a patch bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #137 addressing items 2, 3, and 4 of
docs/2026-05-11-mcu-lockup-analysis.md. Item 1 (replace SATA cable / inspect drive onpalantir) is operator-owned and tracked outside this PR.App.reboot()(unconditionalesp_restart()) instead ofApp.safe_reboot()— the latter hangs inrun_safe_shutdown_hooks()when thehttp_requestcomponent cannot cleanly close, which is exactly the condition the watchdog is trying to recover from.FleetTimeoutTrackerfires a Slack alert when ≥FLEET_TIMEOUT_THRESHOLD(default 2) distinct machines record a state-save timeout withinFLEET_TIMEOUT_WINDOW_SEC(default 60 s), with aFLEET_TIMEOUT_COOLDOWN_SEC(default 300 s) between fires. Catches the 2026-05-11 pattern in which every machine's per-machine counter went 0 → 1 simultaneously and the existing per-machine rule never tripped.CONFIG_ESP_INT_WDT, 300 ms) and bootloader/RTC WDT (CONFIG_BOOTLOADER_WDT_ENABLE/_TIME_MS=9000) on top of the existing task WDT panic.What's in the PR
esphome-configs/2025.11.2/no-current-input.yaml): one-lineApp.safe_reboot()→App.reboot()with comment explaining why; three newsdkconfig_optionsfor the int / bootloader WDTs with a block comment summarizing the three layers.src/dm_mac/models/machine.py): newFleetTimeoutTrackerclass (~30 lines plus docstring) and three new module-level config constants.MachineState._record_save_timeoutnow also calls_notify_fleet_save_timeout, a sibling of the existing_notify_save_timeoutthat consults the tracker and posts to the sameSLACK_CONTROL_CHANNEL_ID. App factory insrc/dm_mac/__init__.pyinstantiates one tracker per app.test_timeout_raises_and_increments_counterupdated to mockcurrent_appsince the new call no longer short-circuits before touching app config.CLAUDE.mdupdated to describe both Slack-notification rules. The full incident analysis was added in95b97be(already onmain).Test plan
nox -s tests— all 293 passing, 97 % coveragenox -s mypy— cleannox -s pre-commit— cleannox -s docs— builds cleanmac-serverwith no card inserted → firmware reboots within 90–120 s and the device fully restarts (nosafe_reboothang):rotating_light:fleet-wide notification arrives in control channel exactly oncepalantirSATA hardware (item 1) — independent operator workSee also
docs/2026-05-11-mcu-lockup-analysis.mddocs/2026-05-05-mcu-lockup-analysis.md🤖 Generated with Claude Code