Skip to content

fix(gateway): refresh MQTT token on broker auth-failure to self-heal expired JWT#4074

Merged
springfall2008 merged 2 commits into
mainfrom
fix/gateway-mqtt-token-refresh-on-auth
Jun 16, 2026
Merged

fix(gateway): refresh MQTT token on broker auth-failure to self-heal expired JWT#4074
springfall2008 merged 2 commits into
mainfrom
fix/gateway-mqtt-token-refresh-on-auth

Conversation

@mgazza

@mgazza mgazza commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Problem

The gateway MQTT access token is a short-lived (24h) JWT. When it expires, GatewayMQTT._mqtt_loop reconnects forever with the same rejected token — the broker returns CONNACK reason code 134 ("Bad user name or password") on every attempt — so PredBat silently loses all gateway control until the AppDaemon pod (and its rendered config) is rebuilt.

The existing proactive refresh (_check_token_refresh) runs from the housekeeping run() loop, which does not rescue a listener already stuck in the reconnect loop. Observed in production across multiple gateway instances: every instance lost control ~24h after its last token render, with predbat.status reporting Inverter 0 write to charge_rate/scheduled_discharge_enable failed (a side effect — batpred could not publish anything).

Fix

Make the reconnect loop self-healing: when the broker rejects authentication, force a token refresh before reconnecting, instead of retrying the dead token.

  • _is_auth_failure(error) — classify CONNACK 134/135 auth rejections vs ordinary drops (e.g. "Disconnected during message iteration", network errors), so we only refresh when a new token can actually help.
  • _maybe_refresh_on_auth_error(error) — force a refresh only for auth failures.
  • _do_token_refresh() / _apply_refresh_response() — extracted from _check_token_refresh so the proactive near-expiry path and the reconnect path share one refresh implementation.
  • Wired into _mqtt_loop's exception handler (before the reconnect backoff sleep).

No behaviour change on the happy path or for non-auth disconnects.

Testing

New apps/predbat/tests/test_gateway_token_refresh.py (12 tests): auth-failure classification, response application (epoch + ISO expiry, failure leaves token unchanged), and that a refresh is forced only on auth failure. Existing gateway tests remain green.

🤖 Generated with Claude Code

…expired JWT

The gateway MQTT access token is a short-lived (24h) JWT. When it expired,
_mqtt_loop reconnected forever with the same rejected token — the broker
returns CONNACK code 134 ("Bad user name or password") on every attempt — so
PredBat silently lost all gateway control until the pod/config was rebuilt.
The proactive near-expiry refresh in _check_token_refresh runs from the
housekeeping loop, which does not help once the listener is stuck reconnecting.

Make the reconnect loop self-healing: on a broker auth-failure it now forces a
token refresh before retrying, instead of looping on the dead token.

- _is_auth_failure(): classify CONNACK 134/135 auth rejections vs ordinary drops
- _maybe_refresh_on_auth_error(): force a refresh only for auth failures
- _do_token_refresh() / _apply_refresh_response(): extracted from
  _check_token_refresh so the proactive and reconnect paths share one refresh
- wired into _mqtt_loop's exception handler

Adds test_gateway_token_refresh.py (12 tests); existing gateway tests stay green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the GatewayMQTT component to self-heal when the MQTT broker rejects authentication (e.g., expired 24h JWT) by forcing a token refresh inside the reconnect loop, preventing infinite reconnect attempts with a dead token.

Changes:

  • Refresh MQTT token on broker auth-failure during _mqtt_loop reconnect handling.
  • Refactor token refresh logic into shared helpers (_is_auth_failure, _maybe_refresh_on_auth_error, _apply_refresh_response, _do_token_refresh) used by both proactive refresh and reconnect paths.
  • Add a new test module for auth-failure classification and refresh response handling, plus cspell dictionary entries.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
apps/predbat/gateway.py Adds auth-failure detection and forces MQTT JWT refresh before reconnecting; refactors refresh implementation into helper methods.
apps/predbat/tests/test_gateway_token_refresh.py Adds regression tests for auth-failure classification and refresh response application (note: currently not wired into the repo’s default test runner).
.cspell/custom-dictionary-workspace.txt Adds connack and emqx to suppress spellcheck noise in new/updated text.

Comment thread apps/predbat/tests/test_gateway_token_refresh.py
Comment thread apps/predbat/gateway.py
@springfall2008 springfall2008 merged commit c546df2 into main Jun 16, 2026
1 check passed
@springfall2008 springfall2008 deleted the fix/gateway-mqtt-token-refresh-on-auth branch June 16, 2026 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants