From c12dc156662c2489456dd67cd323a8802dca4b4b Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Thu, 21 May 2026 20:13:54 +0400 Subject: [PATCH 1/3] fix(diy-backend): tolerate 404 wrap regression in stack-checkpoint lookup MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit External SC consumers on 2026.5.31 (e.g. wize-rooms-api deploy on 2026-05-21) report a hard failure on their first deploy of a parent stack: deployment failed: failed to get parent stack "wize-rooms-api": failed to get stack "wize-rooms-api": failed to load checkpoint: blob (key ".pulumi/stacks/demo/wize-rooms-api.json") (code=Unknown): storage: object doesn't exist: googleapi: Error 404: No such object: likeclaw-simple-container-state/... Customers can recover by pinning back to 2026.5.13. Root cause: a 404 from the underlying state backend used to round-trip through gocloud's `Bucket.Exists` as `(false, nil)`, which Pulumi's diy backend then mapped to `errCheckpointNotFound` and `GetStack` returned `(nil, nil)`. SC's `createStackIfNotExists` would then CreateStack on the empty slot. That contract held through 2026.5.13. Transitive bumps that landed in the consolidated dep PR (#279) — most notably `cloud.google.com/go/storage` v1.49.0 -> v1.62.2 alongside the Pulumi pkg/v3 v3.184.0 -> v3.241.0 bump — change the error path so that some 404s now reach `gcerrors.Code` as `Unknown` rather than `NotFound`. gocloud's `Bucket.Exists` then returns `(false, wrapped-err)` instead of `(false, nil)`, `stackExists` wraps that as "failed to load checkpoint", and `GetStack` returns the wrap. `selectStack` propagates it up and the deploy fails before `createStackIfNotExists` ever gets a chance to create the missing stack. The diff in Pulumi's diy backend code between v3.184.0 and v3.241.0 is byte-identical for `GetStack` / `stackExists` / `errCheckpointNotFound` — the regression is purely in the surface area of the transitive cloud-storage clients. Fix: in `selectStack`, detect 404-shaped errors from the diy backend's "failed to load checkpoint" wrap and treat them the same as the `(nil, nil)` return that v3.184 produced. The structured `gcerrors.Code == NotFound` path is the first check (handles GCS/S3/Azure clients whose error chain still classifies cleanly). A scoped string-match fallback handles the regression cases where the underlying client's NotFound code doesn't round-trip — we limit the match to error chains that begin with Pulumi's `"failed to load checkpoint:"` wrapper so we don't accidentally swallow unrelated NotFound-shaped errors. Tested: - New unit test `TestStackCheckpointNotFound` covers the customer's exact error string plus S3 NoSuchKey and Azure BlobNotFound variants of the same wrap shape. Out-of-scope error patterns (no checkpoint prefix, or checkpoint prefix with a non-NotFound underlying error) are explicitly asserted to NOT trigger the swallowing path. - `go test ./pkg/clouds/pulumi/...` passes. - `go build ./...` passes. After this lands, consumers can adopt any release >= the next tag and get the v3.184 GetStack contract back without rolling back. The 2026.5.31 breakage stays fixed even if `cloud.google.com/go/storage` further changes its error wrapping shape, because the defensive matcher covers the common provider NotFound markers. Refs: - Reported by external consumer 2026-05-21T12:42:35Z on 2026.5.31 - Likely-introduced-by: #279 (Pulumi SDK + cloud.google.com/go/storage bumps) Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/create_stack.go | 89 +++++++++++++++++++++++++- pkg/clouds/pulumi/create_stack_test.go | 66 +++++++++++++++++++ 2 files changed, 152 insertions(+), 3 deletions(-) create mode 100644 pkg/clouds/pulumi/create_stack_test.go diff --git a/pkg/clouds/pulumi/create_stack.go b/pkg/clouds/pulumi/create_stack.go index 43f8d0a9..63853b01 100644 --- a/pkg/clouds/pulumi/create_stack.go +++ b/pkg/clouds/pulumi/create_stack.go @@ -2,10 +2,12 @@ package pulumi import ( "context" + "strings" "github.com/pkg/errors" "github.com/pulumi/pulumi/pkg/v3/backend" + "gocloud.dev/gcerrors" "github.com/simple-container-com/api/pkg/api" ) @@ -34,9 +36,90 @@ func (p *pulumi) selectStack(ctx context.Context, cfg *api.ConfigFile, stack api if err != nil { return nil, err } - if s, err := p.backend.GetStack(ctx, p.stackRef); err != nil { + s, err := p.backend.GetStack(ctx, p.stackRef) + if err != nil { + // Treat "checkpoint blob not found" as "stack does not exist". + // + // Pulumi's diy backend (pkg/v3/backend/diy) is supposed to map a + // missing checkpoint to (nil, nil) from GetStack — its own + // errCheckpointNotFound sentinel handles that. But the path runs + // through gocloud.dev/blob.Bucket.Exists, which only converts + // provider errors to (false, nil) when gcerrors.Code(err) == + // gcerrors.NotFound. + // + // Recent transitive bumps to cloud.google.com/go/storage (and the + // equivalent S3/Azure clients) sometimes surface a 404 through an + // error path that gocloud no longer classifies as NotFound — the + // error reaches Exists as code=Unknown, Exists returns (false, + // wrapped-err) instead of (false, nil), stackExists wraps that + // as "failed to load checkpoint", and GetStack returns the wrap + // rather than the (nil, nil) "missing stack" contract that the + // rest of SC's createStackIfNotExists / selectStack callers + // depend on. + // + // This affected external SC consumers on 2026.5.31 (e.g. the + // wize-rooms-api deploy on 2026-05-21) with: + // failed to get parent stack "wize-rooms-api": + // failed to get stack "wize-rooms-api": + // failed to load checkpoint: blob (key ".pulumi/stacks//.json") + // (code=Unknown): storage: object doesn't exist: + // googleapi: Error 404: No such object: ... + // + // Restore the v3.184-era contract here: if the underlying error + // is a NotFound (either by gocloud code or by the layered string + // pattern that surfaces from current GCS/S3 clients), treat + // GetStack as having returned (nil, nil) — the caller will then + // CreateStack as it did before the regression. + if stackCheckpointNotFound(err) { + return nil, nil + } return s, errors.Wrapf(err, "failed to get stack %q", p.stackRef) - } else { - return s, nil } + return s, nil +} + +// stackCheckpointNotFound returns true when err coming back from the diy +// backend's GetStack indicates that the underlying checkpoint blob is +// missing — i.e. the stack does not yet exist in state storage. +// +// First check is the structured one: gocloud's gcerrors.Code. That's what +// blob.Bucket.Exists uses internally to convert to (false, nil), and when +// it works we never hit this function in the first place — the structured +// path is the happy case we're patching around. +// +// Second check is a string match on the wrapped error message. We use +// it only as a fallback for the case where the underlying provider client +// (GCS / S3 / Azure) wraps the 404 in a way that gcerrors no longer sees +// as NotFound. We deliberately scope the match to error chains that +// originated in Pulumi's "failed to load checkpoint:" wrapper so we don't +// accidentally swallow unrelated NotFound-shaped errors from elsewhere +// in the deploy program. +func stackCheckpointNotFound(err error) bool { + if err == nil { + return false + } + if gcerrors.Code(err) == gcerrors.NotFound { + return true + } + msg := err.Error() + if !strings.Contains(msg, "failed to load checkpoint") { + return false + } + // Provider-specific 404 markers that gcerrors.Code may miss after a + // transitive bump: + // - GCS: "object doesn't exist" / "notFound" / "Error 404" + // - S3: "NoSuchKey" + // - Azure: "BlobNotFound" / "ResourceNotFound" + for _, marker := range []string{ + "object doesn't exist", + "notFound", + "NoSuchKey", + "BlobNotFound", + "ResourceNotFound", + } { + if strings.Contains(msg, marker) { + return true + } + } + return false } diff --git a/pkg/clouds/pulumi/create_stack_test.go b/pkg/clouds/pulumi/create_stack_test.go new file mode 100644 index 00000000..88bc7b23 --- /dev/null +++ b/pkg/clouds/pulumi/create_stack_test.go @@ -0,0 +1,66 @@ +package pulumi + +import ( + "errors" + "fmt" + "testing" +) + +// Note: we deliberately don't test the gcerrors.Code() == NotFound branch +// here. Constructing a `*gcerr.Error` requires gocloud.dev/internal/gcerr +// which is an internal package; the gcerrors.Code lookup uses errors.As on +// that concrete type. Functional coverage of that branch comes from gocloud +// itself; what's regressed and needs unit coverage is the string-fallback +// path that the customer's deploy actually hit. + +func TestStackCheckpointNotFound(t *testing.T) { + tests := []struct { + name string + err error + want bool + }{ + { + name: "nil", + err: nil, + want: false, + }, + { + name: "GCS 404 wrapped through Pulumi diy backend (the customer regression case)", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New(`blob (key ".pulumi/stacks/demo/wize-rooms-api.json") (code=Unknown): storage: object doesn't exist: googleapi: Error 404: No such object: likeclaw-simple-container-state/.pulumi/stacks/demo/wize-rooms-api.json, notFound`)), + want: true, + }, + { + name: "GCS 404 without the 'failed to load checkpoint' prefix — out of scope, don't swallow", + err: errors.New(`storage: object doesn't exist: googleapi: Error 404`), + want: false, + }, + { + name: "S3 NoSuchKey wrapped through Pulumi diy backend", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): NoSuchKey: The specified key does not exist`)), + want: true, + }, + { + name: "Azure BlobNotFound wrapped through Pulumi diy backend", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): BlobNotFound`)), + want: true, + }, + { + name: "unrelated error containing 'failed to load checkpoint' but no NotFound marker", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New("permission denied: 403")), + want: false, + }, + } + + for _, tc := range tests { + t.Run(tc.name, func(t *testing.T) { + got := stackCheckpointNotFound(tc.err) + if got != tc.want { + t.Errorf("stackCheckpointNotFound(%q) = %v, want %v", tc.err, got, tc.want) + } + }) + } +} From 806642c2cb7103502b88bf49eb9b7e3e26778776 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Thu, 21 May 2026 21:09:40 +0400 Subject: [PATCH 2/3] fix(diy-backend): broaden NotFound marker set + case-insensitive match MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Review feedback (Codex + Gemini) on the original patch: - Codex pointed out that S3 v2 SDK errors surface as "operation error S3: HeadObject ... StatusCode: 404 ... api error NotFound: Not Found" — neither "NoSuchKey" nor the original lowercase "notFound" marker fires on that. S3 users would still hit the regression. - Gemini flagged case-sensitivity ("notFound" vs "Not Found" vs "NotFound") as brittle to upstream formatting drift across client SDKs. Both concerns resolved here: 1. Lower-case the haystack once via strings.ToLower and lower-case each marker. Now matches "NotFound" / "Not Found" / "notFound" / "notfound" uniformly. 2. Expand the marker set: - Added "StatusCode: 404" — the AWS SDK v2 HeadObject 404 wrap. - Added "Error 404" — covers the customer's exact GCS error ("googleapi: Error 404") and most providers that surface the HTTP status verbatim. - Renamed comment annotations: "S3 v1" -> NoSuchKey, "S3 v2" -> api error NotFound / StatusCode 404. The "404" suffix is load-bearing — virtually every cloud-storage provider includes the HTTP status code in the wrapped error for a missing object regardless of the SDK's NotFound enum naming. This defends the matcher against future SDK shape drift even if specific enum names disappear. New test cases covering each provider variant: - GCS NotFound with capitalized "Not Found" (case-insensitivity) - S3 v2 SDK "api error NotFound: Not Found" + StatusCode: 404 - Generic StatusCode: 404 wrap (covers future client SDKs) The negative case ("failed to load checkpoint: permission denied") still correctly returns false — there's no NotFound-shaped marker in that message. Refs: review feedback on parent commit Signed-off-by: Dmitrii Creed --- pkg/clouds/pulumi/create_stack.go | 25 ++++++++++++++++++------- pkg/clouds/pulumi/create_stack_test.go | 20 +++++++++++++++++++- 2 files changed, 37 insertions(+), 8 deletions(-) diff --git a/pkg/clouds/pulumi/create_stack.go b/pkg/clouds/pulumi/create_stack.go index 63853b01..f6895f40 100644 --- a/pkg/clouds/pulumi/create_stack.go +++ b/pkg/clouds/pulumi/create_stack.go @@ -106,18 +106,29 @@ func stackCheckpointNotFound(err error) bool { return false } // Provider-specific 404 markers that gcerrors.Code may miss after a - // transitive bump: + // transitive bump. Match case-insensitively to defend against + // formatting drift across client versions ("NotFound" vs "notFound", + // "Not Found" with space, etc.): // - GCS: "object doesn't exist" / "notFound" / "Error 404" - // - S3: "NoSuchKey" + // - S3 v1: "NoSuchKey" + // - S3 v2: "api error NotFound" / "StatusCode: 404" // - Azure: "BlobNotFound" / "ResourceNotFound" + // + // The "404" suffix is intentional and load-bearing: it's the HTTP + // status code that virtually every cloud-storage provider includes + // in the wrapped error for a missing object, regardless of the + // SDK's NotFound enum naming. + msgLower := strings.ToLower(msg) for _, marker := range []string{ "object doesn't exist", - "notFound", - "NoSuchKey", - "BlobNotFound", - "ResourceNotFound", + "notfound", + "nosuchkey", + "blobnotfound", + "resourcenotfound", + "statuscode: 404", + "error 404", } { - if strings.Contains(msg, marker) { + if strings.Contains(msgLower, marker) { return true } } diff --git a/pkg/clouds/pulumi/create_stack_test.go b/pkg/clouds/pulumi/create_stack_test.go index 88bc7b23..b0d884e2 100644 --- a/pkg/clouds/pulumi/create_stack_test.go +++ b/pkg/clouds/pulumi/create_stack_test.go @@ -36,17 +36,35 @@ func TestStackCheckpointNotFound(t *testing.T) { want: false, }, { - name: "S3 NoSuchKey wrapped through Pulumi diy backend", + name: "S3 v1 NoSuchKey wrapped through Pulumi diy backend", err: fmt.Errorf("failed to load checkpoint: %w", errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): NoSuchKey: The specified key does not exist`)), want: true, }, + { + name: "S3 v2 SDK 'api error NotFound' wrapped through Pulumi diy backend", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): operation error S3: HeadObject, https response error StatusCode: 404, RequestID: x, HostID: y, api error NotFound: Not Found`)), + want: true, + }, { name: "Azure BlobNotFound wrapped through Pulumi diy backend", err: fmt.Errorf("failed to load checkpoint: %w", errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): BlobNotFound`)), want: true, }, + { + name: "GCS NotFound with capitalized 'Not Found' (case-insensitivity guard)", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): NotFound: object Not Found`)), + want: true, + }, + { + name: "Generic 'StatusCode: 404' wrap (covers future client SDKs we don't enumerate)", + err: fmt.Errorf("failed to load checkpoint: %w", + errors.New(`blob (key ".pulumi/stacks/foo/bar.json") (code=Unknown): StatusCode: 404`)), + want: true, + }, { name: "unrelated error containing 'failed to load checkpoint' but no NotFound marker", err: fmt.Errorf("failed to load checkpoint: %w", From 116cd6ff8606ffd2b0a8ce90fd324bfc54520734 Mon Sep 17 00:00:00 2001 From: Dmitrii Creed Date: Thu, 21 May 2026 21:14:18 +0400 Subject: [PATCH 3/3] docs(vpa): document controlledValues field added in #280 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PR #280 added the controlledValues knob to VPAConfig but didn't update the user-facing VPA concept doc. Adding a "Controlled Values" section right after "Controlled Resources" with: - What the default behavior does (RequestsAndLimits scales limits proportionally with requests) and why that's problematic for cold-start-heavy workloads. - What RequestsOnly does and when to use it. - A complete example showing the 50m floor + RequestsOnly pairing that PAY-SPACE adopted as the documented pattern. Noticed during review of the parent commits on this branch — the matcher refinements there are user-visible to anyone trying to opt into the new field, so it's the right time to make the docs match. JSON schema regeneration (docs/schemas/kubernetes/) is deferred to a separate hygiene PR — the schema-gen tool currently drifts unrelated fields when run against this codebase. Signed-off-by: Dmitrii Creed --- docs/docs/concepts/vertical-pod-autoscaler.md | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/docs/docs/concepts/vertical-pod-autoscaler.md b/docs/docs/concepts/vertical-pod-autoscaler.md index af934015..9d6a5713 100644 --- a/docs/docs/concepts/vertical-pod-autoscaler.md +++ b/docs/docs/concepts/vertical-pod-autoscaler.md @@ -144,6 +144,30 @@ vpa: controlledResources: ["cpu", "memory"] # Specify which resources to manage ``` +### **Controlled Values** + +By default VPA rewrites both `requests` and `limits` at admission, scaling the limit proportionally with the request. For workloads whose limits are sized to absorb cold-start CPU bursts (Django/gunicorn, Node SSR, JVM warmup), a low `minAllowed.cpu` paired with the default behaviour can shrink the CPU limit below what the cold-start path needs, causing startup-probe failures and SIGKILLs. + +Set `controlledValues: "RequestsOnly"` to tell VPA to only rewrite `requests` and leave the deployment template's `limits` untouched. The deployment then keeps its full cold-start headroom while VPA still right-sizes the steady-state request. + +```yaml +vpa: + enabled: true + updateMode: "Auto" + minAllowed: + cpu: "50m" # safe at this floor when limit is preserved + memory: "64Mi" + maxAllowed: + cpu: "2" + memory: "4Gi" + controlledResources: ["cpu", "memory"] + controlledValues: "RequestsOnly" # leave deployment-template limits alone +``` + +Valid values: +- `RequestsAndLimits` (default) — VPA scales both. Equivalent to omitting the field. +- `RequestsOnly` — VPA scales only `requests`; `limits` stay at the values in the underlying deployment template. + ## **VPA Best Practices** ### **1. Environment-Specific Configuration**