fix(diy-backend): tolerate 404 wrap regression in stack-checkpoint lookup#286
Conversation
…okup External SC consumers on 2026.5.31 (e.g. wize-rooms-api deploy on 2026-05-21) report a hard failure on their first deploy of a parent stack: deployment failed: failed to get parent stack "wize-rooms-api": failed to get stack "wize-rooms-api": failed to load checkpoint: blob (key ".pulumi/stacks/demo/wize-rooms-api.json") (code=Unknown): storage: object doesn't exist: googleapi: Error 404: No such object: likeclaw-simple-container-state/... Customers can recover by pinning back to 2026.5.13. Root cause: a 404 from the underlying state backend used to round-trip through gocloud's `Bucket.Exists` as `(false, nil)`, which Pulumi's diy backend then mapped to `errCheckpointNotFound` and `GetStack` returned `(nil, nil)`. SC's `createStackIfNotExists` would then CreateStack on the empty slot. That contract held through 2026.5.13. Transitive bumps that landed in the consolidated dep PR (#279) — most notably `cloud.google.com/go/storage` v1.49.0 -> v1.62.2 alongside the Pulumi pkg/v3 v3.184.0 -> v3.241.0 bump — change the error path so that some 404s now reach `gcerrors.Code` as `Unknown` rather than `NotFound`. gocloud's `Bucket.Exists` then returns `(false, wrapped-err)` instead of `(false, nil)`, `stackExists` wraps that as "failed to load checkpoint", and `GetStack` returns the wrap. `selectStack` propagates it up and the deploy fails before `createStackIfNotExists` ever gets a chance to create the missing stack. The diff in Pulumi's diy backend code between v3.184.0 and v3.241.0 is byte-identical for `GetStack` / `stackExists` / `errCheckpointNotFound` — the regression is purely in the surface area of the transitive cloud-storage clients. Fix: in `selectStack`, detect 404-shaped errors from the diy backend's "failed to load checkpoint" wrap and treat them the same as the `(nil, nil)` return that v3.184 produced. The structured `gcerrors.Code == NotFound` path is the first check (handles GCS/S3/Azure clients whose error chain still classifies cleanly). A scoped string-match fallback handles the regression cases where the underlying client's NotFound code doesn't round-trip — we limit the match to error chains that begin with Pulumi's `"failed to load checkpoint:"` wrapper so we don't accidentally swallow unrelated NotFound-shaped errors. Tested: - New unit test `TestStackCheckpointNotFound` covers the customer's exact error string plus S3 NoSuchKey and Azure BlobNotFound variants of the same wrap shape. Out-of-scope error patterns (no checkpoint prefix, or checkpoint prefix with a non-NotFound underlying error) are explicitly asserted to NOT trigger the swallowing path. - `go test ./pkg/clouds/pulumi/...` passes. - `go build ./...` passes. After this lands, consumers can adopt any release >= the next tag and get the v3.184 GetStack contract back without rolling back. The 2026.5.31 breakage stays fixed even if `cloud.google.com/go/storage` further changes its error wrapping shape, because the defensive matcher covers the common provider NotFound markers. Refs: - Reported by external consumer 2026-05-21T12:42:35Z on 2026.5.31 - Likely-introduced-by: #279 (Pulumi SDK + cloud.google.com/go/storage bumps) Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
Semgrep Scan ResultsRepository:
Scanned at 2026-05-21 17:15 UTC |
Security Scan ResultsRepository:
Scanned at 2026-05-21 17:15 UTC |
Review feedback (Codex + Gemini) on the original patch:
- Codex pointed out that S3 v2 SDK errors surface as
"operation error S3: HeadObject ... StatusCode: 404 ... api error
NotFound: Not Found" — neither "NoSuchKey" nor the original
lowercase "notFound" marker fires on that. S3 users would still
hit the regression.
- Gemini flagged case-sensitivity ("notFound" vs "Not Found" vs
"NotFound") as brittle to upstream formatting drift across client
SDKs.
Both concerns resolved here:
1. Lower-case the haystack once via strings.ToLower and lower-case
each marker. Now matches "NotFound" / "Not Found" / "notFound" /
"notfound" uniformly.
2. Expand the marker set:
- Added "StatusCode: 404" — the AWS SDK v2 HeadObject 404 wrap.
- Added "Error 404" — covers the customer's exact GCS error
("googleapi: Error 404") and most providers that surface the
HTTP status verbatim.
- Renamed comment annotations: "S3 v1" -> NoSuchKey,
"S3 v2" -> api error NotFound / StatusCode 404.
The "404" suffix is load-bearing — virtually every cloud-storage
provider includes the HTTP status code in the wrapped error for a
missing object regardless of the SDK's NotFound enum naming. This
defends the matcher against future SDK shape drift even if specific
enum names disappear.
New test cases covering each provider variant:
- GCS NotFound with capitalized "Not Found" (case-insensitivity)
- S3 v2 SDK "api error NotFound: Not Found" + StatusCode: 404
- Generic StatusCode: 404 wrap (covers future client SDKs)
The negative case ("failed to load checkpoint: permission denied")
still correctly returns false — there's no NotFound-shaped marker
in that message.
Refs: review feedback on parent commit
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
PR #280 added the controlledValues knob to VPAConfig but didn't update the user-facing VPA concept doc. Adding a "Controlled Values" section right after "Controlled Resources" with: - What the default behavior does (RequestsAndLimits scales limits proportionally with requests) and why that's problematic for cold-start-heavy workloads. - What RequestsOnly does and when to use it. - A complete example showing the 50m floor + RequestsOnly pairing that PAY-SPACE adopted as the documented pattern. Noticed during review of the parent commits on this branch — the matcher refinements there are user-visible to anyone trying to opt into the new field, so it's the right time to make the docs match. JSON schema regeneration (docs/schemas/kubernetes/) is deferred to a separate hygiene PR — the schema-gen tool currently drifts unrelated fields when run against this codebase. Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
|
Reviewed by Codex + Gemini (independent runs). Codex
Gemini
Both points addressed in follow-up commit
Style note (not addressed)
DocsFollowup commit Final tests |
Summary
External SC consumers on
2026.5.31are seeing a hard failure on the first deploy of a parent stack:Pinning back to
2026.5.13works around it, but blocks adopting any newer release (including #280 which is what PAY-SPACE needs for VPAcontrolledValues).Root cause
A 404 from the state backend used to round-trip through gocloud's
Bucket.Existsas(false, nil). Pulumi's diy backend mapped that toerrCheckpointNotFound,GetStackreturned(nil, nil), and SC'screateStackIfNotExistswould create the missing parent stack. That contract held through2026.5.13.The transitive bumps in #279 — most notably
cloud.google.com/go/storagev1.49.0 → v1.62.2 alongsidepkg/v3v3.184.0 → v3.241.0 — change the error path so some GCS 404s reachgcerrors.CodeasUnknownrather thanNotFound.Bucket.Existsthen returns(false, wrapped-err),stackExistswraps as "failed to load checkpoint", andGetStackreturns the wrap —selectStackpropagates and the deploy fails beforecreateStackIfNotExistsruns.The diff in Pulumi's diy backend (
GetStack,stackExists,errCheckpointNotFound) between v3.184.0 and v3.241.0 is byte-identical — the regression is entirely in the transitive cloud-storage client.Fix
In
selectStack, detect 404-shaped errors from the diy backend's "failed to load checkpoint:" wrap and treat them the same as the(nil, nil)return that v3.184 produced.gcerrors.Code(err) == gcerrors.NotFound(the structured path; handles future-fixed clients)."failed to load checkpoint:" + (NotFound marker)across GCS / S3 / Azure provider 404 shapes. Scoped to the checkpoint wrapper to avoid swallowing unrelated NotFound-shaped errors.Test plan
go test ./pkg/clouds/pulumi/...— passes locally.go build ./...— passes locally.TestStackCheckpointNotFound:NoSuchKeywrapped → trueBlobNotFoundwrapped → trueWhy a string fallback
The structured path is preferred —
gcerrors.Codeis the documented surface. But the regression bites precisely because the cloud-storage client's wrap stopped classifying as NotFound. Defending only at the structured layer leaves the regression in place. A scoped string check on a known marker is the smallest patch that restores the contract until upstream gocloud / cloud-storage settles.Why not just downgrade the storage dep
Could be argued, but:
createStackIfNotExists.Refs: external consumer report 2026-05-21T12:42:35Z, #279, #280