XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops by dimoschi · Pull Request #6 · hackthebox/typesense-helm

dimoschi · 2026-04-21T17:54:05Z

$(cat <<'EOF'

Problem

Typesense returns 503 from /health when queued_writes exceeds --healthy-write-lag (default: 500). This threshold is always exceeded during Raft log catch-up after a pod restart or new node join.

With the previous default livenessProbe using httpGet /health, this caused a death spiral:

Pod restarts and joins the cluster
Raft catch-up causes queued_writes > 500 → /health returns 503
Liveness probe kills the pod after failureThreshold failures
Pod restarts and must catch-up from scratch again
Queue pressure from the churning node causes liveness failures on other pods too
Multiple pods restart simultaneously → quorum lost

The previous default failureThreshold: 2 / periodSeconds: 10 made this worse — pods were killed after just 20 seconds of catch-up.

Additionally, users who tried to switch to tcpSocket via HelmRelease values overrides would hit a Kubernetes validation error ("may not specify more than 1 handler type") because Helm merges the override with the chart default, producing both httpGet and tcpSocket in the rendered probe.

Changes

Default livenessProbe switched to tcpSocket — checks only that the process is alive and listening, not queue depth
New typesense.probe helper renders only the handler selected by a .type field (httpGet or tcpSocket), avoiding Helm map-merge from producing multiple handlers when users override the type
startupProbe and readinessProbe remain on httpGet /health — correct behaviour, takes pod out of rotation while catching up without killing it
Default CPU request set to 2000m per Typesense minimum requirements
Sane probe thresholds: liveness failureThreshold: 6 / periodSeconds: 20, startup failureThreshold: 60 / periodSeconds: 10, readiness failureThreshold: 12 / periodSeconds: 10

Switching probe type

Users can now switch probe type cleanly without merge conflicts:

# Switch liveness to httpGet
livenessProbe:
  type: httpGet

# Switch readiness to tcpSocket
readinessProbe:
  type: tcpSocket

Test plan

All 91 helm-unittest tests pass (helm unittest .)
Verify rendered StatefulSet has only tcpSocket in livenessProbe
Verify overriding livenessProbe.type: httpGet renders only httpGet
Deploy to a test cluster and confirm no liveness-triggered restarts during Raft catch-up

Closes #4
EOF
)

…t loops Typesense returns 503 from /health when queued_writes exceeds --healthy-write-lag (default 500). This threshold is always exceeded during Raft log catch-up after a pod restart or new node join, causing liveness probes to kill pods before they finish recovering and breaking quorum in a cascading loop. Changes: - Default livenessProbe switched to tcpSocket (checks process alive only) - Probe type is now controlled via a .type field (httpGet or tcpSocket) rendered by a new typesense.probe helper, avoiding Helm map-merge from producing both handlers simultaneously when users override the type - startupProbe and readinessProbe remain on httpGet /health (correct) - Default CPU request set to 2000m per Typesense minimum requirements - Sane probe thresholds: liveness failureThreshold 6/20s, startup 60/10s, readiness 12/10s Closes #4

github-actions · 2026-04-21T17:54:14Z

PR Metrics

✔ Thanks for keeping your pull request small.
⚠️ Consider adding additional tests.

	Lines
Product Code	69
Test Code	9
Subtotal	78
Ignored Code	44
Total	122

Metrics computed by PR Metrics. Add it to your Azure DevOps and GitHub PRs!

dimoschi requested a review from a team as a code owner April 21, 2026 17:54

github-actions Bot changed the title ~~fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops~~ XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops Apr 21, 2026

📝 update README and values

6d3382d

dimoschi merged commit 970e6b6 into main Apr 21, 2026
3 checks passed

dimoschi deleted the fix/liveness-probe-raft-catchup branch April 21, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops#6

XS⚠️ ◾ fix: liveness probe tcpSocket default to prevent Raft catch-up restart loops#6
dimoschi merged 2 commits into
mainfrom
fix/liveness-probe-raft-catchup

dimoschi commented Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dimoschi commented Apr 21, 2026

Problem

Changes

Switching probe type

Test plan

Uh oh!

github-actions Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Metrics

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 21, 2026 •

edited

Loading