Fix TCP bridge hang when a client stalls or dies silently#49
Open
marklynch wants to merge 3 commits into
Open
Conversation
The client socket returned by accept() was blocking with no send timeout, and all bus-forwarding sends used a blocking send(). If a connected client stopped draining (network drop with no FIN/RST), the kernel send buffer filled and send() blocked the bridge task forever. That task is also the only UART reader, so the whole device froze with no panic, no logs and no recovery — matching the observed weeks-stable-then-dead crash. Make the client socket non-blocking and give send_to_client a keep/drop policy: a full send keeps the client, a full send buffer (EAGAIN) drops the message but keeps the connection, and a partial write or hard error drops the client. Add TCP keepalive (idle 30s, interval 5s, count 3) so a silently dead peer is detected and the single client slot freed. Factor the client teardown into a shared helper. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf
The /status JSON already exposed free/min-free heap, but a web readout only helps while the device is still responsive — useless once it has locked up. Add a low-priority heap_monitor task that logs free heap, the minimum-free watermark, and the largest free block every 5 minutes, so a slow leak or growing fragmentation shows up as a trend in the console/TCP log history before it exhausts memory. Largest-free-block specifically surfaces fragmentation (e.g. from repeated MQTT discovery republishes), which total free heap alone can hide. Also surface the heap figures on the home page: the System table now shows a Memory row (free / min-free) built from the memory fields already present in the /status response. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf
esp_mqtt_client_stop() blocks waiting for the MQTT task to acknowledge. Calling it from WIFI_EVENT_STA_DISCONNECTED runs it on the system event-loop task, so if the MQTT task was itself stuck on a dead socket during the same network outage, the stop would wedge the event loop and stall all further WiFi/IP event processing — a plausible silent-hang path under WiFi flapping. Follow the documented esp-mqtt pattern instead: start the client once on first connectivity and let esp-mqtt manage reconnection internally across WiFi drops and restores. mqtt_client_start() is now idempotent so the GOT_IP handler can keep calling it on every reconnect as a no-op. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The client socket returned by accept() was blocking with no send timeout,
and all bus-forwarding sends used a blocking send(). If a connected client
stopped draining (network drop with no FIN/RST), the kernel send buffer
filled and send() blocked the bridge task forever. That task is also the
only UART reader, so the whole device froze with no panic, no logs and no
recovery — matching the observed weeks-stable-then-dead crash.
Make the client socket non-blocking and give send_to_client a keep/drop
policy: a full send keeps the client, a full send buffer (EAGAIN) drops the
message but keeps the connection, and a partial write or hard error drops
the client. Add TCP keepalive (idle 30s, interval 5s, count 3) so a silently
dead peer is detected and the single client slot freed. Factor the client
teardown into a shared helper.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_018oMk4krXAUqhX5gWtvc1mf