fix(gateway): refresh MQTT token on broker auth-failure to self-heal expired JWT#4074
Merged
Merged
Conversation
…expired JWT
The gateway MQTT access token is a short-lived (24h) JWT. When it expired,
_mqtt_loop reconnected forever with the same rejected token — the broker
returns CONNACK code 134 ("Bad user name or password") on every attempt — so
PredBat silently lost all gateway control until the pod/config was rebuilt.
The proactive near-expiry refresh in _check_token_refresh runs from the
housekeeping loop, which does not help once the listener is stuck reconnecting.
Make the reconnect loop self-healing: on a broker auth-failure it now forces a
token refresh before retrying, instead of looping on the dead token.
- _is_auth_failure(): classify CONNACK 134/135 auth rejections vs ordinary drops
- _maybe_refresh_on_auth_error(): force a refresh only for auth failures
- _do_token_refresh() / _apply_refresh_response(): extracted from
_check_token_refresh so the proactive and reconnect paths share one refresh
- wired into _mqtt_loop's exception handler
Adds test_gateway_token_refresh.py (12 tests); existing gateway tests stay green.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the GatewayMQTT component to self-heal when the MQTT broker rejects authentication (e.g., expired 24h JWT) by forcing a token refresh inside the reconnect loop, preventing infinite reconnect attempts with a dead token.
Changes:
- Refresh MQTT token on broker auth-failure during
_mqtt_loopreconnect handling. - Refactor token refresh logic into shared helpers (
_is_auth_failure,_maybe_refresh_on_auth_error,_apply_refresh_response,_do_token_refresh) used by both proactive refresh and reconnect paths. - Add a new test module for auth-failure classification and refresh response handling, plus cspell dictionary entries.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| apps/predbat/gateway.py | Adds auth-failure detection and forces MQTT JWT refresh before reconnecting; refactors refresh implementation into helper methods. |
| apps/predbat/tests/test_gateway_token_refresh.py | Adds regression tests for auth-failure classification and refresh response application (note: currently not wired into the repo’s default test runner). |
| .cspell/custom-dictionary-workspace.txt | Adds connack and emqx to suppress spellcheck noise in new/updated text. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The gateway MQTT access token is a short-lived (24h) JWT. When it expires,
GatewayMQTT._mqtt_loopreconnects forever with the same rejected token — the broker returns CONNACK reason code134("Bad user name or password") on every attempt — so PredBat silently loses all gateway control until the AppDaemon pod (and its rendered config) is rebuilt.The existing proactive refresh (
_check_token_refresh) runs from the housekeepingrun()loop, which does not rescue a listener already stuck in the reconnect loop. Observed in production across multiple gateway instances: every instance lost control ~24h after its last token render, withpredbat.statusreportingInverter 0 write to charge_rate/scheduled_discharge_enable failed(a side effect — batpred could not publish anything).Fix
Make the reconnect loop self-healing: when the broker rejects authentication, force a token refresh before reconnecting, instead of retrying the dead token.
_is_auth_failure(error)— classify CONNACK134/135auth rejections vs ordinary drops (e.g. "Disconnected during message iteration", network errors), so we only refresh when a new token can actually help._maybe_refresh_on_auth_error(error)— force a refresh only for auth failures._do_token_refresh()/_apply_refresh_response()— extracted from_check_token_refreshso the proactive near-expiry path and the reconnect path share one refresh implementation._mqtt_loop's exception handler (before the reconnect backoff sleep).No behaviour change on the happy path or for non-auth disconnects.
Testing
New
apps/predbat/tests/test_gateway_token_refresh.py(12 tests): auth-failure classification, response application (epoch + ISO expiry, failure leaves token unchanged), and that a refresh is forced only on auth failure. Existing gateway tests remain green.🤖 Generated with Claude Code