Skip to content

XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops#6

Merged
dimoschi merged 2 commits into
mainfrom
fix/liveness-probe-raft-catchup
Apr 21, 2026
Merged

XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops#6
dimoschi merged 2 commits into
mainfrom
fix/liveness-probe-raft-catchup

Conversation

@dimoschi
Copy link
Copy Markdown
Collaborator

$(cat <<'EOF'

Problem

Typesense returns 503 from /health when queued_writes exceeds --healthy-write-lag (default: 500). This threshold is always exceeded during Raft log catch-up after a pod restart or new node join.

With the previous default livenessProbe using httpGet /health, this caused a death spiral:

  1. Pod restarts and joins the cluster
  2. Raft catch-up causes queued_writes > 500/health returns 503
  3. Liveness probe kills the pod after failureThreshold failures
  4. Pod restarts and must catch-up from scratch again
  5. Queue pressure from the churning node causes liveness failures on other pods too
  6. Multiple pods restart simultaneously → quorum lost

The previous default failureThreshold: 2 / periodSeconds: 10 made this worse — pods were killed after just 20 seconds of catch-up.

Additionally, users who tried to switch to tcpSocket via HelmRelease values overrides would hit a Kubernetes validation error ("may not specify more than 1 handler type") because Helm merges the override with the chart default, producing both httpGet and tcpSocket in the rendered probe.

Changes

  • Default livenessProbe switched to tcpSocket — checks only that the process is alive and listening, not queue depth
  • New typesense.probe helper renders only the handler selected by a .type field (httpGet or tcpSocket), avoiding Helm map-merge from producing multiple handlers when users override the type
  • startupProbe and readinessProbe remain on httpGet /health — correct behaviour, takes pod out of rotation while catching up without killing it
  • Default CPU request set to 2000m per Typesense minimum requirements
  • Sane probe thresholds: liveness failureThreshold: 6 / periodSeconds: 20, startup failureThreshold: 60 / periodSeconds: 10, readiness failureThreshold: 12 / periodSeconds: 10

Switching probe type

Users can now switch probe type cleanly without merge conflicts:

# Switch liveness to httpGet
livenessProbe:
  type: httpGet

# Switch readiness to tcpSocket
readinessProbe:
  type: tcpSocket

Test plan

  • All 91 helm-unittest tests pass (helm unittest .)
  • Verify rendered StatefulSet has only tcpSocket in livenessProbe
  • Verify overriding livenessProbe.type: httpGet renders only httpGet
  • Deploy to a test cluster and confirm no liveness-triggered restarts during Raft catch-up

Closes #4
EOF
)

…t loops

Typesense returns 503 from /health when queued_writes exceeds
--healthy-write-lag (default 500). This threshold is always exceeded
during Raft log catch-up after a pod restart or new node join, causing
liveness probes to kill pods before they finish recovering and breaking
quorum in a cascading loop.

Changes:
- Default livenessProbe switched to tcpSocket (checks process alive only)
- Probe type is now controlled via a .type field (httpGet or tcpSocket)
  rendered by a new typesense.probe helper, avoiding Helm map-merge from
  producing both handlers simultaneously when users override the type
- startupProbe and readinessProbe remain on httpGet /health (correct)
- Default CPU request set to 2000m per Typesense minimum requirements
- Sane probe thresholds: liveness failureThreshold 6/20s, startup 60/10s,
  readiness 12/10s

Closes #4
@dimoschi dimoschi requested a review from a team as a code owner April 21, 2026 17:54
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

PR Metrics

Thanks for keeping your pull request small.
⚠️ Consider adding additional tests.

Lines
Product Code 69
Test Code 9
Subtotal 78
Ignored Code 44
Total 122

Metrics computed by PR Metrics. Add it to your Azure DevOps and GitHub PRs!

@github-actions github-actions Bot changed the title fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops Apr 21, 2026
@dimoschi dimoschi merged commit 970e6b6 into main Apr 21, 2026
3 checks passed
@dimoschi dimoschi deleted the fix/liveness-probe-raft-catchup branch April 21, 2026 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default liveness probe causes restart death spiral during Raft catch-up

1 participant