Skip to content

Fix MCU lockup recovery: hard-reboot watchdog, fleet-wide Slack alert, defense-in-depth WDTs#145

Merged
jantman merged 2 commits into
mainfrom
fix/mcu-lockup-2026-05-11-followup
May 12, 2026
Merged

Fix MCU lockup recovery: hard-reboot watchdog, fleet-wide Slack alert, defense-in-depth WDTs#145
jantman merged 2 commits into
mainfrom
fix/mcu-lockup-2026-05-11-followup

Conversation

@jantman

@jantman jantman commented May 12, 2026

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #137 addressing items 2, 3, and 4 of docs/2026-05-11-mcu-lockup-analysis.md. Item 1 (replace SATA cable / inspect drive on palantir) is operator-owned and tracked outside this PR.

# Layer Change
2 Firmware Liveness watchdog calls App.reboot() (unconditional esp_restart()) instead of App.safe_reboot() — the latter hangs in run_safe_shutdown_hooks() when the http_request component cannot cleanly close, which is exactly the condition the watchdog is trying to recover from.
3 Server New FleetTimeoutTracker fires a Slack alert when ≥ FLEET_TIMEOUT_THRESHOLD (default 2) distinct machines record a state-save timeout within FLEET_TIMEOUT_WINDOW_SEC (default 60 s), with a FLEET_TIMEOUT_COOLDOWN_SEC (default 300 s) between fires. Catches the 2026-05-11 pattern in which every machine's per-machine counter went 0 → 1 simultaneously and the existing per-machine rule never tripped.
4 Firmware Defense-in-depth: enable interrupt WDT (CONFIG_ESP_INT_WDT, 300 ms) and bootloader/RTC WDT (CONFIG_BOOTLOADER_WDT_ENABLE / _TIME_MS=9000) on top of the existing task WDT panic.

What's in the PR

  • Firmware (esphome-configs/2025.11.2/no-current-input.yaml): one-line App.safe_reboot()App.reboot() with comment explaining why; three new sdkconfig_options for the int / bootloader WDTs with a block comment summarizing the three layers.
  • Server (src/dm_mac/models/machine.py): new FleetTimeoutTracker class (~30 lines plus docstring) and three new module-level config constants. MachineState._record_save_timeout now also calls _notify_fleet_save_timeout, a sibling of the existing _notify_save_timeout that consults the tracker and posts to the same SLACK_CONTROL_CHANNEL_ID. App factory in src/dm_mac/__init__.py instantiates one tracker per app.
  • Tests: 8 new unit tests on the tracker (distinct counting, window aging, cooldown arming, cooldown expiration, configurable threshold) and 3 integration tests on the save-cache path (fires through tracker; tracker absent → silent; below threshold → silent). Existing test_timeout_raises_and_increments_counter updated to mock current_app since the new call no longer short-circuits before touching app config.
  • Docs: CLAUDE.md updated to describe both Slack-notification rules. The full incident analysis was added in 95b97be (already on main).

Test plan

  • nox -s tests — all 293 passing, 97 % coverage
  • nox -s mypy — clean
  • nox -s pre-commit — clean
  • nox -s docs — builds clean
  • Operator-driven post-merge:
    • Flash firmware to one MCU (suggested: Bronte), soak ≥ 24 h
    • Induced-failure: stop mac-server with no card inserted → firmware reboots within 90–120 s and the device fully restarts (no safe_reboot hang)
    • Induced-failure: simulate two distinct machines hitting state-save timeouts within 60 s → Slack :rotating_light: fleet-wide notification arrives in control channel exactly once
    • Flash remaining MCUs after Bronte soak passes
    • Address palantir SATA hardware (item 1) — independent operator work

See also

🤖 Generated with Claude Code

jantman and others added 2 commits May 11, 2026 20:08
Captures the second post-PR-#137 lockup of the same four MCUs.
Root cause is a SATA link-layer reset on palantir (third such
event in a week); PR #137's server-side 503 path worked as
designed but the firmware liveness watchdog hung in
App.safe_reboot()'s synchronous shutdown hooks, leaving three
relay-off MCUs unresponsive until manual power-cycle ~38 min
later. Documents timeline, root cause, what worked vs. what
didn't, and prioritized recommendations (fix SATA hardware,
switch watchdog to App.reboot(), add fleet-wide Slack alert).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, defense-in-depth WDTs.

Follow-up to PR #137 addressing items 2, 3, and 4 of
docs/2026-05-11-mcu-lockup-analysis.md.

Firmware (esphome-configs/2025.11.2/no-current-input.yaml):

- Item 2: change watchdog action from App.safe_reboot() to
  App.reboot(). safe_reboot() runs every component's synchronous
  on_shutdown() hook and waits indefinitely for it to return;
  http_request's shutdown hook blocks when the network/server is
  unreachable, which is exactly the condition the watchdog is trying
  to recover from. On 2026-05-11 this left three relay-off MCUs
  unresponsive for ~40 minutes until manually power-cycled.
  App.reboot() calls esp_restart() directly.

- Item 4: enable interrupt WDT (CONFIG_ESP_INT_WDT, 300 ms) to catch
  loop blocks inside critical sections, and bootloader/RTC WDT
  (CONFIG_BOOTLOADER_WDT_ENABLE / _TIME_MS=9000) as a hardware
  last-resort beyond the existing task WDT panic.

Server (src/dm_mac/models/machine.py, src/dm_mac/__init__.py):

- Item 3: add FleetTimeoutTracker for cross-machine timeout
  accounting. Fires a Slack notification when at least
  FLEET_TIMEOUT_THRESHOLD distinct machines (default 2) record a
  state-save timeout within FLEET_TIMEOUT_WINDOW_SEC (default 60 s),
  with FLEET_TIMEOUT_COOLDOWN_SEC (default 300 s) between fires. The
  existing per-machine >=2-lifetime rule is preserved unchanged; the
  new rule catches the 2026-05-11 pattern (four machines each at
  count 1 simultaneously, no notification fired).

Tests:

- 8 new TestFleetTimeoutTracker tests covering distinct-machine
  counting, window expiration, cooldown, configurable threshold.
- 3 new TestFleetTimeoutNotification integration tests covering the
  Slack-fire-from-save-cache-timeout path and absent-tracker /
  below-threshold negative paths.
- Existing test_timeout_raises_and_increments_counter updated to
  mock current_app, since the new fleet-wide call no longer
  short-circuits before touching app config.

Docs:

- CLAUDE.md updated to describe both Slack-notification rules.

All 293 tests pass; 97% coverage; mypy / pre-commit / docs all
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissBranchBrPartCoverMissing
src/dm_mac
   __init__.py73060100% 
   cli_utils.py15000100% 
   neon_fob_adder.py2321560593%79, 116–117, 124, 270, 333–334, 341, 364–367, 454–456
   neongetter.py211154399%309
   slack_handler.py1650420100% 
   utils.py25040100% 
src/dm_mac/models
   __init__.py0000100% 
   api_schemas.py34000100% 
   machine.py580161941697%589, 669, 961–963, 1075–1084, 1142
   users.py1030320100% 
src/dm_mac/views
   __init__.py0000100% 
   api.py32000100% 
   machine.py1030120100% 
   prometheus.py1320121100% 
TOTAL1705324162598% 

Tests Skipped Failures Errors Time
295 0 💤 0 ❌ 0 🔥 20.092s ⏱️

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR is a follow-up hardening pass on the MCU lockup recovery path and on server observability for disk-related stalls, based on the 2026-05-11 incident analysis. It updates the ESPHome watchdog behavior to force an unconditional reboot, adds defense-in-depth watchdog layers in firmware, and introduces a server-side fleet-wide Slack alert for correlated state-save timeouts across multiple machines.

Changes:

  • Firmware: switch watchdog action from App.safe_reboot() to App.reboot(); enable interrupt WDT and bootloader/RTC WDT via sdkconfig_options.
  • Server: add FleetTimeoutTracker plus fleet-wide timeout window/threshold/cooldown constants; emit Slack alert when multiple distinct machines time out within the window.
  • Tests/docs: add unit/integration coverage for the fleet tracker and update docs to describe the two Slack paging rules.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
tests/models/test_machine_state.py Adds unit tests for FleetTimeoutTracker and integration tests verifying fleet-wide Slack notification behavior via save_cache() timeouts.
src/dm_mac/models/machine.py Implements FleetTimeoutTracker, fleet-wide config constants, and hooks fleet-wide notification into the state-save timeout path.
src/dm_mac/__init__.py Instantiates one FleetTimeoutTracker per app and stores it in current_app.config.
esphome-configs/2025.11.2/no-current-input.yaml Enables additional watchdog layers and changes liveness watchdog reboot to unconditional App.reboot().
docs/2026-05-11-mcu-lockup-analysis.md Adds/updates the incident analysis and recommendations that this PR implements.
CLAUDE.md Documents the new fleet-wide Slack notification rule alongside the existing per-machine rule.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@claude

claude Bot commented May 12, 2026

Copy link
Copy Markdown

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@jantman jantman merged commit 6de4fab into main May 12, 2026
21 checks passed
@jantman jantman deleted the fix/mcu-lockup-2026-05-11-followup branch May 12, 2026 09:49
jantman added a commit that referenced this pull request May 12, 2026
Covers two PRs merged after 0.12.0:

- PR #145 (MCU lockup recovery): App.safe_reboot() → App.reboot()
  in firmware liveness watchdog, new FleetTimeoutTracker for
  fleet-wide Slack alerting, interrupt + bootloader/RTC WDTs as
  defense-in-depth.
- PR #146 (dependency refresh): poetry update rollup covering all
  six Dependabot PRs (#139#144) plus mypy 2.0.0 → 2.1.0 and
  others; CI poetry pin bumped to 2.4.1.

Also folds in the unreleased 0.12.1 (ESPHome 2025.11.2 firmware
compilation fixes from 6ca8ff7).

Minor version bump (0.12.x → 0.13.0) reflects the new
FleetTimeoutTracker feature; the firmware fix and dep refresh
alone would have been a patch bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants