Skip to content

feat(cloud): self-serve managed Valkey + search instances#279

Open
KIvanow wants to merge 13 commits into
feat/valkey-search-chartfrom
feat/valkey-instances
Open

feat(cloud): self-serve managed Valkey + search instances#279
KIvanow wants to merge 13 commits into
feat/valkey-search-chartfrom
feat/valkey-instances

Conversation

@KIvanow

@KIvanow KIvanow commented Jun 25, 2026

Copy link
Copy Markdown
Member

Summary

Lets a logged-in cloud user provision and tear down a managed Valkey instance (with the Search module), reachable over a public TLS endpoint, straight from the Monitor UI. Builds on the charts/valkey-search Helm chart by wiring it into the entitlement provisioner and exposing a user-facing flow.

Changes

  • entitlement: ValkeyInstance Prisma model + migration; new valkey-instance module (AdminGuard CRUD, cap-at-1 per workspace enforced in the service layer, global name uniqueness since the name doubles as the SNI host). provisioning.service.ts renders charts/valkey-search via helm template and applies the manifests with the k8s SDK (KubernetesObjectApi, the only client that can create the Traefik IngressRouteTCP CRD). Credentials are read back from the k8s Secret; the password is never stored in Postgres.
  • cloud-auth: /workspace/databases endpoints (list / create / credentials / delete) proxy the cloud session to the entitlement admin API, gated by requireAdminOrOwner.
  • web: Databases page (create form capped at 1, status polling, connection card with rediss:// URL + valkey-cli line, show/hide credentials, delete) plus API client, route, and sidebar entry.
  • infra: entitlement image bundles helm + the chart; RBAC gains statefulsets, persistentvolumeclaims, and Traefik ingressroutetcps; shared-infra manifests + runbook for the public exposure layer (Traefik NLB + SNI routing + cert-manager wildcard cert) under proprietary/infra/k8s/valkey-public/.

Checklist

  • Unit / integration tests added
  • Docs added / updated
  • Roborev review passed — run roborev review --branch or /roborev-review-branch in Claude Code (internal)
  • Competitive analysis done / discussed (internal)
  • Blog post about it discussed (internal)

Note

High Risk
Introduces async K8s provisioning, public TLS exposure, credential retrieval from secrets, and ACL/network-policy changes—security- and infrastructure-critical paths with limited automated test coverage in the diff.

Overview
Self-serve managed Valkey lets cloud workspaces create, monitor, and delete a capped (one per tenant) Valkey instance with Search, exposed on a shared wildcard TLS endpoint.

Backend: Entitlement gains a ValkeyInstance model (migrations: per-tenant name uniqueness, globally unique opaque SNI host), a valkey-instance API module, and async provisioning in provisioning.service.ts that creates K8s secrets (password not in Postgres), runs helm template on charts/valkey-search, applies manifests (including Traefik IngressRouteTCP), widens tenant quotas/network policy for Traefik, and handles provision/delete races with guarded status updates and orphan cleanup. The entitlement image bundles Helm + the chart; RBAC adds StatefulSets, PVCs, and IngressRouteTCPs. cloud-auth proxies /workspace/databases to entitlement and fixes bodyless DELETE requests (no JSON Content-Type).

Monitor: New databases API client; cloud Add Connection tab to create/list/delete instances, poll status, and show rediss:// / valkey-cli credentials; useValkeyAutoLink (in AppLayout) auto-registers TLS connections for ready instances (admin/owner). UnifiedDatabaseAdapter sets TLS servername to the hostname for SNI-routed endpoints.

Hardening: Valkey user ACLs switch from -@dangerous to an explicit deny list so Monitor observability commands still work (aligned in chart secret.yaml and provisioning). Infra runbook under proprietary/infra/k8s/valkey-public/ documents shared Traefik NLB, cert-manager wildcard, and SNI routing.

Reviewed by Cursor Bugbot for commit 8da7694. Bugbot is set up for automated code reviews on this repo. Configure here.

Lets a logged-in cloud user provision and tear down a managed Valkey
instance (with the Search module) over a public TLS endpoint, straight
from the Monitor UI.

- entitlement: ValkeyInstance Prisma model + migration; valkey-instance
  module (AdminGuard CRUD, cap-at-1 per workspace in the service layer,
  global name uniqueness since the name is the SNI host); provisioning
  service renders charts/valkey-search via `helm template` and applies it
  with the k8s SDK (KubernetesObjectApi), reads credentials from the k8s
  Secret (password never stored in Postgres)
- cloud-auth: /workspace/databases endpoints proxy the cloud session to
  the entitlement admin API
- web: Databases page (create / status polling / connection card / delete)
  plus API client, route, and sidebar entry
- infra: bundle helm + the chart into the entitlement image; RBAC for
  statefulsets, PVCs, and Traefik IngressRouteTCP; shared-infra manifests
  and runbook for the public exposure layer (Traefik NLB + SNI +
  cert-manager wildcard)
Comment thread proprietary/cloud-auth/workspace/workspace.controller.ts Outdated
Comment thread proprietary/entitlement/src/valkey-instance/valkey-instance.service.ts Outdated
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
Comment thread proprietary/entitlement/src/valkey-instance/valkey-instance.service.ts Outdated
- IDOR: scope /workspace/databases credentials + delete to the caller's
  tenant; entitlement verifies instance.tenantId matches (404 otherwise)
- name uniqueness race: add a DB unique constraint on valkey_instances.name
  and surface P2002 as a clean conflict
- provision/delete race: only flip status to ready/error if the row is
  still 'provisioning' (updateMany guard) so a concurrent delete wins
- per-tenant cap race: do the count + create in a Serializable transaction
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
…race

If a concurrent delete moves the row out of 'provisioning' (or removes
it) while provisioning is still applying manifests or waiting on the
StatefulSet, the terminal conditional update finds no row and provisioning
exited without tearing down the objects it just created, orphaning the
StatefulSet/Service/Secret/IngressRouteTCP/PVCs.

Extract the k8s teardown into teardownValkeyResources and call it from the
provision lost-race branch as well as deprovision. deleteManifests tolerates
not-found, so double teardown is idempotent.
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
The success-path race handler cleans up freshly created k8s objects when a
concurrent delete moves the row out of 'provisioning', but the failure path
only updated status and left objects behind in the same race. Apply the same
teardownValkeyResources cleanup when the error-path conditional update finds
no 'provisioning' row.
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
Two issues blocked a provisioned Valkey pod from running and being reachable:

- The tenant ResourceQuota (300m/320Mi req, 2 pods) had no room for the
  Valkey pod on top of the Monitor app, so the pod could not schedule and
  provisioning timed out. Add an includeValkey budget (450m/640Mi req,
  1200m/2Gi lim, 3 pods) and apply it from the Valkey provision path so
  existing tenants are widened too.
- The tenant-isolation NetworkPolicy only allowed ingress from kube-system,
  so the shared Traefik proxy could not reach the pod on 6379. Add an ingress
  rule for the traefik namespace on 6379 and ensure the policy is brought up
  to spec during Valkey provisioning.
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
… policy

ensureTenantNetworkPolicy did a PUT replace without a resourceVersion, which
can be rejected with 409 against an existing policy. Read the current policy
first and carry its resourceVersion into the replace; fall back to create on
404.
@KIvanow KIvanow requested a review from jamby77 June 25, 2026 09:09
patchNamespacedResourceQuota defaults to the JSON Patch content type, which
expects an array of ops; the { spec } object was rejected with a 400
("cannot unmarshal object into []handlers.jsonPatchOp"). This path is always
taken when widening an existing tenant's quota during Valkey provisioning, so
provisioning failed for any tenant that already had a quota. Pass the
merge-patch content type via setHeaderOptions. Verified end to end on a kind
cluster.
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts Outdated
KIvanow added 3 commits June 25, 2026 21:33
…riendly ACL

The instance name doubled as the global SNI host, forcing global name
uniqueness and risking cross-tenant collisions on the shared
valkey.app.betterdb.com wildcard. Derive the public host from an opaque
hash of the instance id instead (globally unique by construction, with a
DB-level host unique constraint as a guard), and scope name uniqueness to
the tenant.

Replace the app user's blanket -@dangerous ACL with +@ALL minus an
explicit deny-list of destructive commands: -@dangerous also strips the
observability commands (INFO/CLIENT/SLOWLOG/LATENCY/CONFIG GET/MONITOR)
that Monitor needs, and INFO runs during capability detection so the
connection was rejected outright.

Cap instance names at 25 chars (StatefulSet/pod label limit) and bundle
prisma.config.ts in the image so `prisma migrate deploy` can resolve the
datasource URL in-cluster.
Move the standalone Databases page into a new "BetterDB Valkey instances"
tab in the Add Connection modal. When an instance becomes ready, auto-add
a direct Monitor connection to it (named after the instance), and remove
that connection again when the instance is deleted so it doesn't linger.

Send the TLS SNI servername (the host, unless it is a bare IP) so
HostSNI-routed endpoints like the managed Valkey Traefik front end serve
the right cert instead of returning a non-RESP response. Only set the
DELETE Content-Type when there is a body so bodyless deletes aren't
rejected.
….com

Use the existing app.betterdb.com Route53 zone (pinned hosted zone id) for
the cert-manager DNS-01 solver and issue the wildcard for
*.valkey.app.betterdb.com, avoiding a separate delegated zone.

// Valkey instance provisioning config (chart bundled into the image)
this.valkeyChartPath = this.config.get<string>('VALKEY_CHART_PATH', '/app/charts/valkey-search');
this.valkeyDomain = this.config.get<string>('VALKEY_DOMAIN', 'valkey.betterdb.com');

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong default Valkey public domain

Low Severity

The create-instance input’s HTML pattern accepts two-letter names like ab, but the entitlement API enforces @MinLength(3). Users can pass browser validation yet get a failed create from the server.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b2b39d9. Configure here.

…t quota

Guard the pending->provisioning claim so a concurrent delete that moved the
row to 'deleting' is not clobbered back to 'provisioning' (which would flip a
deleted instance to 'ready'). Also keep the Valkey quota headroom when
re-provisioning a tenant that already has a managed instance, so the quota is
not shrunk below what the running Valkey pod needs.
Comment thread proprietary/entitlement/src/provisioning/provisioning.service.ts Outdated
The wildcard cert, ClusterIssuer, and DNS all target *.valkey.app.betterdb.com,
but the code default was valkey.betterdb.com, so instances would only get a
matching host when the env var was overridden. Align the default with the
provisioned infra.
Comment thread apps/web/src/components/ConnectionSelector.tsx Outdated
Comment thread apps/web/src/components/ConnectionSelector.tsx Outdated
Comment thread apps/web/src/components/ConnectionSelector.tsx Outdated
Move the managed-instance auto-link out of the Add Connection dialog into an
always-mounted hook (useValkeyAutoLink in AppLayout). Previously the effect
lived in ValkeyInstancesTab, which only mounts while the dialog is open, so an
instance provisioned and then left alone never got mirrored into the connection
list despite the copy promising it. The hook is gated to cloud admin/owner (the
credentials endpoint is admin/owner-only and connections are tenant-scoped) and
reconciles on an interval, which also retries a transiently failed link.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8da7694. Configure here.

required
maxLength={VALKEY_NAME_MAX_LENGTH}
pattern="[a-z][a-z0-9-]*[a-z0-9]"
title="Lowercase letters, digits and hyphens; must start with a letter"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Form allows two character names

Low Severity

The Valkey name input’s HTML pattern accepts two-character values like ab, while the entitlement API requires MinLength(3) on name. Browser validation passes and create fails at the API with a validation error.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8da7694. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant