feat(cloud): self-serve managed Valkey + search instances#279
Conversation
Lets a logged-in cloud user provision and tear down a managed Valkey instance (with the Search module) over a public TLS endpoint, straight from the Monitor UI. - entitlement: ValkeyInstance Prisma model + migration; valkey-instance module (AdminGuard CRUD, cap-at-1 per workspace in the service layer, global name uniqueness since the name is the SNI host); provisioning service renders charts/valkey-search via `helm template` and applies it with the k8s SDK (KubernetesObjectApi), reads credentials from the k8s Secret (password never stored in Postgres) - cloud-auth: /workspace/databases endpoints proxy the cloud session to the entitlement admin API - web: Databases page (create / status polling / connection card / delete) plus API client, route, and sidebar entry - infra: bundle helm + the chart into the entitlement image; RBAC for statefulsets, PVCs, and Traefik IngressRouteTCP; shared-infra manifests and runbook for the public exposure layer (Traefik NLB + SNI + cert-manager wildcard)
- IDOR: scope /workspace/databases credentials + delete to the caller's tenant; entitlement verifies instance.tenantId matches (404 otherwise) - name uniqueness race: add a DB unique constraint on valkey_instances.name and surface P2002 as a clean conflict - provision/delete race: only flip status to ready/error if the row is still 'provisioning' (updateMany guard) so a concurrent delete wins - per-tenant cap race: do the count + create in a Serializable transaction
…race If a concurrent delete moves the row out of 'provisioning' (or removes it) while provisioning is still applying manifests or waiting on the StatefulSet, the terminal conditional update finds no row and provisioning exited without tearing down the objects it just created, orphaning the StatefulSet/Service/Secret/IngressRouteTCP/PVCs. Extract the k8s teardown into teardownValkeyResources and call it from the provision lost-race branch as well as deprovision. deleteManifests tolerates not-found, so double teardown is idempotent.
The success-path race handler cleans up freshly created k8s objects when a concurrent delete moves the row out of 'provisioning', but the failure path only updated status and left objects behind in the same race. Apply the same teardownValkeyResources cleanup when the error-path conditional update finds no 'provisioning' row.
Two issues blocked a provisioned Valkey pod from running and being reachable: - The tenant ResourceQuota (300m/320Mi req, 2 pods) had no room for the Valkey pod on top of the Monitor app, so the pod could not schedule and provisioning timed out. Add an includeValkey budget (450m/640Mi req, 1200m/2Gi lim, 3 pods) and apply it from the Valkey provision path so existing tenants are widened too. - The tenant-isolation NetworkPolicy only allowed ingress from kube-system, so the shared Traefik proxy could not reach the pod on 6379. Add an ingress rule for the traefik namespace on 6379 and ensure the policy is brought up to spec during Valkey provisioning.
… policy ensureTenantNetworkPolicy did a PUT replace without a resourceVersion, which can be rejected with 409 against an existing policy. Read the current policy first and carry its resourceVersion into the replace; fall back to create on 404.
patchNamespacedResourceQuota defaults to the JSON Patch content type, which
expects an array of ops; the { spec } object was rejected with a 400
("cannot unmarshal object into []handlers.jsonPatchOp"). This path is always
taken when widening an existing tenant's quota during Valkey provisioning, so
provisioning failed for any tenant that already had a quota. Pass the
merge-patch content type via setHeaderOptions. Verified end to end on a kind
cluster.
…riendly ACL The instance name doubled as the global SNI host, forcing global name uniqueness and risking cross-tenant collisions on the shared valkey.app.betterdb.com wildcard. Derive the public host from an opaque hash of the instance id instead (globally unique by construction, with a DB-level host unique constraint as a guard), and scope name uniqueness to the tenant. Replace the app user's blanket -@dangerous ACL with +@ALL minus an explicit deny-list of destructive commands: -@dangerous also strips the observability commands (INFO/CLIENT/SLOWLOG/LATENCY/CONFIG GET/MONITOR) that Monitor needs, and INFO runs during capability detection so the connection was rejected outright. Cap instance names at 25 chars (StatefulSet/pod label limit) and bundle prisma.config.ts in the image so `prisma migrate deploy` can resolve the datasource URL in-cluster.
Move the standalone Databases page into a new "BetterDB Valkey instances" tab in the Add Connection modal. When an instance becomes ready, auto-add a direct Monitor connection to it (named after the instance), and remove that connection again when the instance is deleted so it doesn't linger. Send the TLS SNI servername (the host, unless it is a bare IP) so HostSNI-routed endpoints like the managed Valkey Traefik front end serve the right cert instead of returning a non-RESP response. Only set the DELETE Content-Type when there is a body so bodyless deletes aren't rejected.
….com Use the existing app.betterdb.com Route53 zone (pinned hosted zone id) for the cert-manager DNS-01 solver and issue the wildcard for *.valkey.app.betterdb.com, avoiding a separate delegated zone.
|
|
||
| // Valkey instance provisioning config (chart bundled into the image) | ||
| this.valkeyChartPath = this.config.get<string>('VALKEY_CHART_PATH', '/app/charts/valkey-search'); | ||
| this.valkeyDomain = this.config.get<string>('VALKEY_DOMAIN', 'valkey.betterdb.com'); |
There was a problem hiding this comment.
Wrong default Valkey public domain
Low Severity
The create-instance input’s HTML pattern accepts two-letter names like ab, but the entitlement API enforces @MinLength(3). Users can pass browser validation yet get a failed create from the server.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit b2b39d9. Configure here.
…t quota Guard the pending->provisioning claim so a concurrent delete that moved the row to 'deleting' is not clobbered back to 'provisioning' (which would flip a deleted instance to 'ready'). Also keep the Valkey quota headroom when re-provisioning a tenant that already has a managed instance, so the quota is not shrunk below what the running Valkey pod needs.
The wildcard cert, ClusterIssuer, and DNS all target *.valkey.app.betterdb.com, but the code default was valkey.betterdb.com, so instances would only get a matching host when the env var was overridden. Align the default with the provisioned infra.
Move the managed-instance auto-link out of the Add Connection dialog into an always-mounted hook (useValkeyAutoLink in AppLayout). Previously the effect lived in ValkeyInstancesTab, which only mounts while the dialog is open, so an instance provisioned and then left alone never got mirrored into the connection list despite the copy promising it. The hook is gated to cloud admin/owner (the credentials endpoint is admin/owner-only and connections are tenant-scoped) and reconciles on an interval, which also retries a transiently failed link.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8da7694. Configure here.
| required | ||
| maxLength={VALKEY_NAME_MAX_LENGTH} | ||
| pattern="[a-z][a-z0-9-]*[a-z0-9]" | ||
| title="Lowercase letters, digits and hyphens; must start with a letter" |
There was a problem hiding this comment.
Form allows two character names
Low Severity
The Valkey name input’s HTML pattern accepts two-character values like ab, while the entitlement API requires MinLength(3) on name. Browser validation passes and create fails at the API with a validation error.
Reviewed by Cursor Bugbot for commit 8da7694. Configure here.


Summary
Lets a logged-in cloud user provision and tear down a managed Valkey instance (with the Search module), reachable over a public TLS endpoint, straight from the Monitor UI. Builds on the
charts/valkey-searchHelm chart by wiring it into the entitlement provisioner and exposing a user-facing flow.Changes
ValkeyInstancePrisma model + migration; newvalkey-instancemodule (AdminGuard CRUD, cap-at-1 per workspace enforced in the service layer, global name uniqueness since the name doubles as the SNI host).provisioning.service.tsrenderscharts/valkey-searchviahelm templateand applies the manifests with the k8s SDK (KubernetesObjectApi, the only client that can create the TraefikIngressRouteTCPCRD). Credentials are read back from the k8s Secret; the password is never stored in Postgres./workspace/databasesendpoints (list / create / credentials / delete) proxy the cloud session to the entitlement admin API, gated byrequireAdminOrOwner.rediss://URL +valkey-cliline, show/hide credentials, delete) plus API client, route, and sidebar entry.helm+ the chart; RBAC gainsstatefulsets,persistentvolumeclaims, and Traefikingressroutetcps; shared-infra manifests + runbook for the public exposure layer (Traefik NLB + SNI routing + cert-manager wildcard cert) underproprietary/infra/k8s/valkey-public/.Checklist
roborev review --branchor/roborev-review-branchin Claude Code (internal)Note
High Risk
Introduces async K8s provisioning, public TLS exposure, credential retrieval from secrets, and ACL/network-policy changes—security- and infrastructure-critical paths with limited automated test coverage in the diff.
Overview
Self-serve managed Valkey lets cloud workspaces create, monitor, and delete a capped (one per tenant) Valkey instance with Search, exposed on a shared wildcard TLS endpoint.
Backend: Entitlement gains a
ValkeyInstancemodel (migrations: per-tenant name uniqueness, globally unique opaque SNIhost), avalkey-instanceAPI module, and async provisioning inprovisioning.service.tsthat creates K8s secrets (password not in Postgres), runshelm templateoncharts/valkey-search, applies manifests (including TraefikIngressRouteTCP), widens tenant quotas/network policy for Traefik, and handles provision/delete races with guarded status updates and orphan cleanup. The entitlement image bundles Helm + the chart; RBAC adds StatefulSets, PVCs, and IngressRouteTCPs. cloud-auth proxies/workspace/databasesto entitlement and fixes bodyless DELETE requests (no JSONContent-Type).Monitor: New
databasesAPI client; cloud Add Connection tab to create/list/delete instances, poll status, and showrediss:///valkey-clicredentials;useValkeyAutoLink(inAppLayout) auto-registers TLS connections for ready instances (admin/owner).UnifiedDatabaseAdaptersets TLSservernameto the hostname for SNI-routed endpoints.Hardening: Valkey user ACLs switch from
-@dangerousto an explicit deny list so Monitor observability commands still work (aligned in chartsecret.yamland provisioning). Infra runbook underproprietary/infra/k8s/valkey-public/documents shared Traefik NLB, cert-manager wildcard, and SNI routing.Reviewed by Cursor Bugbot for commit 8da7694. Bugbot is set up for automated code reviews on this repo. Configure here.