XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops#6
Merged
Merged
Conversation
…t loops Typesense returns 503 from /health when queued_writes exceeds --healthy-write-lag (default 500). This threshold is always exceeded during Raft log catch-up after a pod restart or new node join, causing liveness probes to kill pods before they finish recovering and breaking quorum in a cascading loop. Changes: - Default livenessProbe switched to tcpSocket (checks process alive only) - Probe type is now controlled via a .type field (httpGet or tcpSocket) rendered by a new typesense.probe helper, avoiding Helm map-merge from producing both handlers simultaneously when users override the type - startupProbe and readinessProbe remain on httpGet /health (correct) - Default CPU request set to 2000m per Typesense minimum requirements - Sane probe thresholds: liveness failureThreshold 6/20s, startup 60/10s, readiness 12/10s Closes #4
PR Metrics✔ Thanks for keeping your pull request small.
Metrics computed by PR Metrics. Add it to your Azure DevOps and GitHub PRs! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
$(cat <<'EOF'
Problem
Typesense returns
503from/healthwhenqueued_writesexceeds--healthy-write-lag(default: 500). This threshold is always exceeded during Raft log catch-up after a pod restart or new node join.With the previous default
livenessProbeusinghttpGet /health, this caused a death spiral:queued_writes > 500→/healthreturns 503failureThresholdfailuresThe previous default
failureThreshold: 2 / periodSeconds: 10made this worse — pods were killed after just 20 seconds of catch-up.Additionally, users who tried to switch to
tcpSocketvia HelmRelease values overrides would hit a Kubernetes validation error ("may not specify more than 1 handler type") because Helm merges the override with the chart default, producing bothhttpGetandtcpSocketin the rendered probe.Changes
livenessProbeswitched totcpSocket— checks only that the process is alive and listening, not queue depthtypesense.probehelper renders only the handler selected by a.typefield (httpGetortcpSocket), avoiding Helm map-merge from producing multiple handlers when users override the typestartupProbeandreadinessProberemain onhttpGet /health— correct behaviour, takes pod out of rotation while catching up without killing it2000mper Typesense minimum requirementsfailureThreshold: 6 / periodSeconds: 20, startupfailureThreshold: 60 / periodSeconds: 10, readinessfailureThreshold: 12 / periodSeconds: 10Switching probe type
Users can now switch probe type cleanly without merge conflicts:
Test plan
helm unittest .)tcpSocketin livenessProbelivenessProbe.type: httpGetrenders onlyhttpGetCloses #4
EOF
)