Skip to content

feat(charts): valkey-search Helm chart for self-host and cloud tenants#278

Open
KIvanow wants to merge 4 commits into
masterfrom
feat/valkey-search-chart
Open

feat(charts): valkey-search Helm chart for self-host and cloud tenants#278
KIvanow wants to merge 4 commits into
masterfrom
feat/valkey-search-chart

Conversation

@KIvanow

@KIvanow KIvanow commented Jun 24, 2026

Copy link
Copy Markdown
Member

Single chart for both the self-host (marketing) profile and cloud per-tenant provisioning. Ships Valkey with the search module (FT.*), an ACL-hardened named user loaded from an aclfile at boot (survives restarts), optional AOF persistence + maxmemory, and an optional Traefik IngressRouteTCP HostSNI route for public TLS exposure.

The container runs the bundle image via args (not command) so its entrypoint auto-loads the search module.

Summary

Changes

Checklist

  • Unit / integration tests added
  • Docs added / updated
  • Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
  • Competitive analysis done / discussed (internal)
  • Blog post about it discussed (internal)

Note

Medium Risk
New deploy surface for credentials, ACLs, and optional public TLS routing; misconfiguration (GitOps password rotation, external Secret ACLs) could break clients or expose Valkey incorrectly.

Overview
Adds a new charts/valkey-search Helm chart that deploys Valkey with the search module (valkey/valkey-bundle) as a single-replica StatefulSet, ClusterIP Service, and install notes.

Self-host defaults use chart-managed auth: a Secret with password and boot-loaded users.acl (default user off, optional -@dangerous hardening with +info for monitoring), stable passwords on upgrade via Helm lookup, and whitespace validation on passwords. Cloud tenants can point at auth.existingSecret and layer values-cloud-tenant.yaml for persistence, maxmemory, resources, and public exposure.

The workload starts valkey-server via args so the bundle entrypoint still loads the search module; readiness checks FT.SEARCH registration, not just PING. Optional Traefik IngressRouteTCP registers a per-tenant HostSNI route when exposure.public and exposure.sniRoute are enabled. Pod rollout annotations hash chart-rendered or external Secrets so credential/ACL changes trigger restarts.

Reviewed by Cursor Bugbot for commit 3877eda. Bugbot is set up for automated code reviews on this repo. Configure here.

Single chart for both the self-host (marketing) profile and cloud
per-tenant provisioning. Ships Valkey with the search module (FT.*),
an ACL-hardened named user loaded from an aclfile at boot (survives
restarts), optional AOF persistence + maxmemory, and an optional
Traefik IngressRouteTCP HostSNI route for public TLS exposure.

The container runs the bundle image via args (not command) so its
entrypoint auto-loads the search module.
@KIvanow KIvanow requested a review from jamby77 June 24, 2026 14:53
Comment thread charts/valkey-search/templates/statefulset.yaml
*/}}
{{- $rules := "~* &* +@all" }}
{{- if .Values.hardening.disableDangerousCommands }}
{{- $rules = "~* &* +@all -@dangerous" }}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-@dangerous blocks INFO. Verified on valkey-bundle 9.1: INFO is in @dangerous (alongside DEBUG/FLUSHALL), so +@all -@dangerous denies it. The BetterDB Monitor's metric collection — and common library health/version checks — then get NOPERM on every instance provisioned with the default hardening. CONFIG/SLOWLOG/CLIENT are not in @dangerous, so only INFO regresses; easy to grant back:

Suggested change
{{- $rules = "~* &* +@all -@dangerous" }}
{{- $rules = "~* &* +@all -@dangerous +info" }}

Re-grant any other @dangerous command a client needs (check with ACL CAT dangerous).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, fixed in b8be535 — ACL is now +@all -@dangerous +info.

{{- if .Values.hardening.disableDangerousCommands }}
{{- $rules = "~* &* +@all -@dangerous" }}
{{- end }}
{{- $acl := printf "user default off\nuser %s on >%s %s\n" $username $password $rules }}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bring-your-own auth.password with whitespace/special chars corrupts the aclfile → CrashLoop. ACL rules are space-delimited, so auth.password: "p@ss w0rd" renders user betterdb on >p@ss w0rd ~* &* +@allw0rd… becomes stray ACL tokens, the aclfile fails to load, and Valkey CrashLoopBackOffs. Generated randAlphaNum passwords are safe, but the documented BYO value isn't validated. A guard would catch it early, e.g. {{- if regexMatch "\\s" $password }}{{- fail "auth.password must not contain whitespace" }}{{- end }}.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8be535: added a fail guard rejecting whitespace in auth.password at template time.

Comment on lines +14 to +20
{{- $existing := (lookup "v1" "Secret" .Release.Namespace $secretName) }}
{{- $password := .Values.auth.password }}
{{- if not $password }}
{{- if and $existing $existing.data (index $existing.data "password") }}
{{- $password = index $existing.data "password" | b64dec }}
{{- else }}
{{- $password = randAlphaNum 32 }}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lookup-based password reuse silently rotates the password under GitOps. The "never rotates out from under connected clients" guarantee depends on lookup reading the live Secret. Argo CD / Flux render with helm template (no cluster access), so lookup returns empty → randAlphaNum 32 mints a new password on every sync, rotating the Secret + ACL and breaking connected clients. Worth documenting that GitOps installs must set auth.existingSecret (or a fixed auth.password), since the reuse path can't work without a live cluster read.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in b8be535 (values.yaml): GitOps installs must set existingSecret or a fixed auth.password.

size: 2Gi # ~2.5x maxmemory: holds RDB + AOF + rewrite headroom

valkey:
maxmemory: 768mb # ~75% of the 1Gi limit

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxmemory at ~75% of the pod memory limit risks OOMKill during AOF/RDB rewrite. With maxmemory: 768mb, limits.memory: 1Gi, and appendonly: true, a BGREWRITEAOF/RDB-save fork's copy-on-write pages plus AOF rewrite buffers under write load can push RSS past 1Gi → OOMKilled mid-rewrite. ~256Mi of headroom is tight for a forking persistence model; 50–60% of the limit is the usual guidance once persistence is on.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8be535: dropped to 512mb (~50% of the limit).

Comment on lines +86 to +91
readinessProbe:
exec:
command:
- sh
- -c
- 'valkey-cli --user {{ .Values.auth.username }} --pass "$VALKEY_PASSWORD" ping'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readiness only pings — it never verifies the search module actually loaded. If the module fails to load but valkey-server stays up (FT.* unknown, not a crash), the pod reports Ready while the chart's entire purpose is broken. A readiness probe that runs FT._LIST (or COMMAND DOCS FT.SEARCH) would catch a silent module-load failure. The :latest double-load case is already covered since it crashes, but a non-crashing miss isn't.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b8be535: readiness now runs COMMAND INFO FT.SEARCH | grep -q to confirm the module loaded.

- statefulset: hash the live external Secret's data for checksum/secret so
  pods roll when an entitlement-owned existingSecret rotates
- secret: re-grant +info (INFO is in @dangerous) so Monitor metric
  collection keeps working under hardening
- secret: fail fast when a bring-your-own auth.password contains whitespace
  (it corrupts the space-delimited ACL file and CrashLoops the pod)
- statefulset: readiness now verifies the search module loaded
  (COMMAND INFO FT.SEARCH) instead of only pinging
- values-cloud-tenant: drop maxmemory to ~50% of the limit to avoid
  OOMKills during AOF/RDB rewrite forks
- values: document that GitOps installs must set existingSecret or a fixed
  password, since lookup-based reuse can't work without a live cluster read

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
{{- if and $existing $existing.data (index $existing.data "password") }}
{{- $password = index $existing.data "password" | b64dec }}
{{- else }}
{{- $password = randAlphaNum 32 }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitOps sync rotates password

High Severity

When auth.password is empty, the chart falls back to lookup and then randAlphaNum if no Secret exists. GitOps render paths (helm template, many Argo CD flows) cannot lookup, so every sync generates a new password, rewrites the Secret and ACL, and disconnects clients despite the upgrade-reuse intent.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b8be535. Configure here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in b8be535 (values.yaml): GitOps installs must set existingSecret or a fixed auth.password since the lookup-reuse path needs a live cluster read.

Comment thread charts/valkey-search/templates/NOTES.txt
When auth.existingSecret is set the chart renders no users.acl, so the
hardening flag does not control runtime permissions. Show an external-ACL
note instead of misstating the tenant's actual permissions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 36b8592. Configure here.

Comment thread charts/valkey-search/templates/secret.yaml Outdated
… explicit

The whitespace guard only checked .Values.auth.password, so a password
loaded from an existing Secret (the upgrade lookup path) bypassed it and
could still corrupt the space-delimited ACL file. Guard the resolved
$password instead, covering explicit, looked-up, and generated values.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants