Skip to content

Postgres: installed os-apps wiped on pod restart — Phase 8b only runs for Turso store #150

@nerdsane

Description

@nerdsane

Summary

When Temper runs with --storage postgres, previously-installed OS apps are lost on process restart. The SpecRegistry repopulates platform + Agent specs at boot, but there is no code path that replays tenant os-app installs from Postgres — the Phase 8b recovery exists only for the Turso store.

Observable effect: after POST /observe/os-apps/<app>/install, the entity sets (e.g. /tdata/GitTokens) work until the next pod restart. After restart, {"error":{"code":"EntitySetNotFound","message":"Entity set 'GitTokens' not found"}} is returned for anything the user app defined.

Environment

  • Temper HEAD 078777d2aea5.
  • Storage: --storage postgres (Cloud SQL Postgres 16).
  • Deployment: k8s Deployment, 2 replicas.
  • Repro was against the dark-helix OS app with 21 entity types.

Root cause

Two gaps working together:

  1. temper-store-postgres has no installed-apps tracking. Trait methods on PlatformStore:

    • is_app_installed(tenant, app)
    • record_installed_app(tenant, app)
    • list_all_installed_apps()

    are implemented in temper-store-turso::store::specs (see
    crates/temper-store-turso/src/store/specs.rs:246) but have no equivalent in crates/temper-store-postgres/. Grepping the crate returns zero hits for list_all_installed_apps / record_installed_app / is_app_installed.

  2. Phase 8b only wires the Turso path. In crates/temper-cli/src/serve/bootstrap.rs::bootstrap_installed_apps (~line 564):

    if let Some(ref store) = state.server.event_store
        && let Some(turso) = store.platform_turso_store()
    {
        match turso.list_all_installed_apps().await { ... }
    }

    With a Postgres-only deployment platform_turso_store() returns None, so the whole replay block is skipped. There is no fallback to PostgresStore::list_all_installed_apps() (which doesn't exist) or to a file-catalog scan.

    The earlier comment at line 517 of the same file — "OS app specs are already restored from the specs table by restore_registry_from_turso (Phase 2) and Cedar policies by recover_cedar_policies (Phase 6), so no reinstall loop is needed" — explicitly depends on the Turso restore path, which doesn't exist for Postgres.

Reproduce

  1. Run Temper with --storage postgres against a fresh Postgres database.
  2. Ship an OS app bundle into TEMPER_OS_APPS_DIR (e.g. an initContainer extracting a tarball to /apps/dark-helix).
  3. POST /observe/os-apps/dark-helix/install with {"tenant": "dark-helix"}. Returns 200 with added: [...entities].
  4. GET /tdata/<any-entity-from-the-bundle> with X-Tenant-Id: dark-helix → 200 OK.
  5. Delete the pod (kubectl delete pod -l app=temper). Wait for the new pod.
  6. Re-issue the same GET → 404 EntitySetNotFound.

Suggested fix

Two orthogonal layers are worth fixing — a short-term unblocker and a long-term proper solution:

Short-term: add a filesystem-catalog reinstall on startup

Gate behind an env var or a CLI flag so existing Turso-based users aren't affected. Pseudocode:

// Phase 8b: re-install apps found on the local catalog into TEMPER_TENANT.
if std::env::var("TEMPER_AUTO_INSTALL_APPS").ok().as_deref() == Some("true") {
    let tenant = std::env::var("TEMPER_TENANT").unwrap_or_else(|_| "default".into());
    for entry in os_apps::list_os_apps() {
        if let Err(e) = os_apps::install_os_app(state, &tenant, &entry.name).await {
            tracing::warn!("auto-install failed for app='{}': {e}", entry.name);
        }
    }
}

This sidesteps the Postgres-tracking gap entirely. It's also the right behavior for deployments where the app catalog is shipped via image/initContainer/PVC (i.e. the canonical k8s pattern) — the filesystem IS the source of truth, so no DB tracking is needed.

Long-term: Postgres implementations of the three trait methods

Add a tenant_installed_apps table in temper-store-postgres::schema, implement the three PlatformStore trait methods, and remove the Turso-only gate in bootstrap_installed_apps. This is the symmetric solution but requires a migration.

Context

Hit this bringing up Temper on GKE as the control plane for the dark-helix factory. Currently working around by (a) scaling the Temper Deployment to 1 replica and (b) re-running the install after every pod restart. Neither is acceptable for a production control plane. Planning to implement the short-term fix on a local fix branch and build from that while the upstream work lands.

Related: #148 (tenant_secrets migration — another Postgres-specific init-path bug from the same pipeline).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions