Make liveness and readiness probes independently configurable (#116)#161
Open
hocmemini wants to merge 1 commit into
Open
Make liveness and readiness probes independently configurable (#116)#161hocmemini wants to merge 1 commit into
hocmemini wants to merge 1 commit into
Conversation
…t#116) The chart hardcoded identical livenessProbe and readinessProbe definitions in templates/deployment.yaml, both pointing at GET / on the http port. This prevented operators from diverging readiness from liveness, which blocked graceful drain: any condition that flipped readiness to non-200 would also fail liveness and kill the pod, terminating in-flight WebRTC sessions. Expose both probes as fully structured blocks in values.yaml. Defaults render to the same endpoint as before (GET / on the http named port) with explicit threshold and timing fields. Every explicit field matches the Kubernetes implicit default that prior chart users were already relying on, except livenessProbe.initialDelaySeconds, which moves from 0 to 10. That shift is strictly more lenient (kubelet waits 10s after container start before its first liveness check) and cannot kill a pod earlier than the prior chart did. A one-time Deployment rollout will occur on helm upgrade because the rendered probe spec has additional explicit fields. Operators can now point readinessProbe at a drain-aware endpoint (built-in to livekit-server, a sidecar, or a custom path) without touching livenessProbe. Wraps both probes with {{- with }} so setting either value to null cleanly omits the field from the rendered Deployment. This is the Path B approach (operator configurability via the chart) rather than a Path A approach (server-side readiness state in livekit-server itself). Path B is non-invasive, unblocks operators today, and is forward compatible with a future Path A: when livekit-server gains a dedicated drain endpoint, operators just point readinessProbe at it. Add helm-unittest coverage in livekit-server/tests/ across two suites: probe_configuration_test.yaml exercises defaults, partial overrides, both probes nulled, mixed null, and the named-port contract under a custom livekit.port; upgrade_compatibility_test.yaml locks in the rendered before-vs-after delta for default values and asserts non-probe fields are unchanged. 13 tests total, all passing under "helm unittest livekit-server/". Closes livekit#116
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make liveness and readiness probes independently configurable
Closes #116.
Summary
The chart previously hardcoded identical
livenessProbeandreadinessProbedefinitions in
templates/deployment.yaml, both pointing atGET /on thehttpport. This made it impossible for operators to diverge readiness fromliveness, which in turn blocked any kind of graceful drain: anything that
caused the readiness check to fail (intentionally, during shutdown) would also
fail liveness and kill the pod, terminating in-flight WebRTC sessions.
This PR exposes both probes as fully structured blocks in
values.yamlsooperators can override them independently. The defaults render to the same
endpoint as before (
GET /on thehttpnamed port), with explicit, sensiblethreshold/timing fields.
Changes
livekit-server/values.yaml: New top-levellivenessProbeandreadinessProbeblocks with sensible defaults. Comment block explains whythey should not share a definition and how to use the readiness override
for graceful shutdown.
livekit-server/templates/deployment.yaml: Hardcoded probe definitionsreplaced with
{{- with .Values.livenessProbe }} ... {{- toYaml . | nindent 12 }} {{- end }}(same for readinessProbe). Setting either value tonullcleanly omits that probe from the rendered Deployment.
server-sample.yaml: Added a commented-out readiness override snippet sothe override pattern is discoverable from the sample values file.
livekit-server/tests/probe_configuration_test.yaml(new): helm-unittestsuite covering five scenarios. Details in the Testing section below.
livekit-server/tests/upgrade_compatibility_test.yaml(new): helm-unittestsuite that locks in the rendered before-vs-after delta with default values.
Default behavior vs prior chart
Before: the chart relied on Kubernetes implicit probe defaults
(
failureThreshold: 3,periodSeconds: 10,timeoutSeconds: 1,initialDelaySeconds: 0) for both probes.After: the chart sets those values explicitly, with one intentional
behavioral delta:
livenessProbe.failureThresholdlivenessProbe.periodSecondslivenessProbe.timeoutSecondslivenessProbe.initialDelaySecondsreadinessProbe.failureThresholdreadinessProbe.periodSecondsreadinessProbe.timeoutSecondsreadinessProbe.initialDelaySecondsThe single behavioral delta is
livenessProbe.initialDelaySeconds: 0 -> 10.Replacing implicit Kubernetes defaults with explicit chart values means
picking opinionated numbers for every operator who does not override them,
so this choice deserves to be deliberate rather than incidental. 10 seconds
covers a realistic worst case for cold-start work: image pull on a node that
does not already have the layer cached, the livekit-server process startup
itself, configmap mount and environment population, and any redis or other
dependency handshake when configured. The value also lines up with the
initialDelaySeconds defaults used by production-grade Helm charts in the
broader community (Bitnami, the prometheus-community charts, and similar)
for HTTP-served workloads, so operators familiar with those will not be
surprised. The choice is strictly safer than the prior implicit 0: the
kubelet's first liveness probe now lands 10 seconds later, never sooner,
so no pod can be killed earlier than the prior chart would have killed it,
and the chart's existing 30-seconds-of-failures-before-kill margin
(
failureThreshold=3withperiodSeconds=10) is preserved after theinitial delay window. Operators who explicitly prefer the prior timing
can set
livenessProbe.initialDelaySeconds: 0in their values.Backwards compatibility
livenessProbe/readinessProbekeys did not exist before, so noexisting user could already have them set.
rendered probe spec has additional explicit fields (each of which matches
the Kubernetes implicit default the rollout was already using). The
upgrade-compatibility test suite documents this exactly.
Testing
Tests are written for the helm-unittest plugin
(
helm plugin install https://github.com/helm-unittest/helm-unittest).Scenarios covered
tests/probe_configuration_test.yaml(10 tests):target
GET /on thehttpnamed port; bothfailureThresholddefaultsare 3; the two probes are independently rendered, proven by their
different default
initialDelaySecondsvalues.readinessProbe.httpGet.path,readinessProbe.failureThreshold, andlivenessProbe.failureThresholdoverrides only those leaves; all otherdefaults on both probes survive.
nullvia a values file omits both fields from the rendered Deployment, with the
rest of the container spec (name, image, ports) intact.
default omits liveness only and vice versa.
livekit.port(1 test): overridinglivekit.portchanges the container's HTTP port number but probes still reference the
named port
http, documenting that operators do not need to touch probeport references when changing
livekit.port.tests/upgrade_compatibility_test.yaml(3 tests): asserts the exact set ofexplicit probe fields rendered after the change, asserts each rendered field
matches the Kubernetes implicit default except the documented liveness
initialDelaySecondsdelta, and asserts non-probe fields (image, ports,container name, hostNetwork, dnsPolicy, terminationGracePeriodSeconds) are
unchanged. The header comments include the full diff captured at commit time
and a copy-paste reproduction recipe.
Coverage gaps (out of scope for this PR)
No live-cluster install test of the new probe values. The repo's
existing
ct installmatrix does runhelm installon microk8s, whichwould catch a chart that fails to render or apply, but it does not assert
runtime probe behavior (e.g., that
/actually returns 200 fromlivekit-server). Runtime probe behavior is unchanged from prior chart.
No test for the graceful-shutdown drain endpoint behavior itself.
This PR exposes the configuration surface; the SIGTERM/drain endpoint
belongs in livekit-server. See "Path Considered But Not Taken" below.
No test for non-httpGet probe types (
tcpSocket,exec,grpc).toYamlis type-agnostic so an operator override of any ofthese should render correctly, but no assertion locks that in.
No test for
livenessProbe: {}(empty map) as distinct fromlivenessProbe: ~(null). Both should result in the field being omittedbecause Go templates treat both as falsy under
{{- with }}, but onlynull is asserted.
No test for override conflicts, e.g., setting
livenessProbe.tcpSocketwhile leaving the default
httpGetblock intact. Helm's value-mergecannot deep-replace a sibling key, so the rendered probe would contain
both
httpGetandtcpSocket, which Kubernetes rejects. Operatorsswitching probe types must replace the entire block. This footgun is
documented in the values.yaml comment but not test-asserted.
No test for
successThreshold, which K8s allows on readiness probesbut is not in the chart defaults.
No
startupProbesupport added. Operators who need one can add itvia a chart values addition in a follow-up; this PR is scoped to the
liveness/readiness pairing called out in Readiness probe must not be the same as the liveness one #116.
No CI step yet runs
helm unittestin this repo. The current CIworkflow runs
ct lintandct installonly. To wire these tests intoCI, a maintainer would add a step similar to:
inserted after the existing "Set up Helm" step (CI uses Helm v3.11.2;
Helm v4 users locally may need
--verify=falseon the plugin install).I have intentionally not modified the workflow file in this PR to keep
the chart-change diff focused; happy to add the CI step in this PR or a
follow-up if maintainers prefer.
No snapshot tests. helm-unittest supports
matchSnapshotfor full-doccomparison; I have not added any because the explicit-field assertions in
upgrade_compatibility_test.yamlare easier to read and maintain whenunrelated chart changes (image tag bump, label addition) inevitably arrive.
Running the tests
Test names are scenario-prefixed (
A1.,B1.,C1.,D1.,D2.,E1.)so a failure line points at the exact scenario it broke.
Local verification
helm templatewas verified to produce output identical to the previous chartfor default values, except for the probe-field additions documented in the
table above.
Path Considered But Not Taken
This PR implements operator-configurable probes (Path B). A more complete fix
(Path A) would require server-side support: a separate
/readyendpoint inlivekit-server itself that returns a non-200 response when the process has
received SIGTERM but is still draining active sessions. The current PR gives
operators the configuration surface to point readinessProbe at such an
endpoint when it exists upstream, without requiring server-side changes to
land first.
Path B was chosen because:
(for example via a
lifecycle.preStophook or a sidecar that drains stateand flips its own readiness endpoint to 503).
called out in Readiness probe must not be the same as the liveness one #116.
A follow-up issue in livekit/livekit tracking the server-side endpoint work
would be the right place to capture Path A.