Skip to content

fix(diy-backend): tolerate 404 wrap regression in stack-checkpoint lookup#286

Merged
Cre-eD merged 3 commits into
mainfrom
fix/parent-stack-404-tolerance
May 21, 2026
Merged

fix(diy-backend): tolerate 404 wrap regression in stack-checkpoint lookup#286
Cre-eD merged 3 commits into
mainfrom
fix/parent-stack-404-tolerance

Conversation

@Cre-eD
Copy link
Copy Markdown
Contributor

@Cre-eD Cre-eD commented May 21, 2026

Summary

External SC consumers on 2026.5.31 are seeing a hard failure on the first deploy of a parent stack:

deployment failed: failed to get parent stack "wize-rooms-api":
failed to get stack "wize-rooms-api":
failed to load checkpoint: blob (key ".pulumi/stacks/demo/wize-rooms-api.json") (code=Unknown):
storage: object doesn't exist: googleapi: Error 404: No such object:
likeclaw-simple-container-state/.pulumi/stacks/demo/wize-rooms-api.json, notFound

Pinning back to 2026.5.13 works around it, but blocks adopting any newer release (including #280 which is what PAY-SPACE needs for VPA controlledValues).

Root cause

A 404 from the state backend used to round-trip through gocloud's Bucket.Exists as (false, nil). Pulumi's diy backend mapped that to errCheckpointNotFound, GetStack returned (nil, nil), and SC's createStackIfNotExists would create the missing parent stack. That contract held through 2026.5.13.

The transitive bumps in #279 — most notably cloud.google.com/go/storage v1.49.0 → v1.62.2 alongside pkg/v3 v3.184.0 → v3.241.0 — change the error path so some GCS 404s reach gcerrors.Code as Unknown rather than NotFound. Bucket.Exists then returns (false, wrapped-err), stackExists wraps as "failed to load checkpoint", and GetStack returns the wrap — selectStack propagates and the deploy fails before createStackIfNotExists runs.

The diff in Pulumi's diy backend (GetStack, stackExists, errCheckpointNotFound) between v3.184.0 and v3.241.0 is byte-identical — the regression is entirely in the transitive cloud-storage client.

Fix

In selectStack, detect 404-shaped errors from the diy backend's "failed to load checkpoint:" wrap and treat them the same as the (nil, nil) return that v3.184 produced.

  • Primary: gcerrors.Code(err) == gcerrors.NotFound (the structured path; handles future-fixed clients).
  • Fallback: scoped string match for "failed to load checkpoint:" + (NotFound marker) across GCS / S3 / Azure provider 404 shapes. Scoped to the checkpoint wrapper to avoid swallowing unrelated NotFound-shaped errors.

Test plan

  • go test ./pkg/clouds/pulumi/... — passes locally.
  • go build ./... — passes locally.
  • New unit test TestStackCheckpointNotFound:
    • nil → false
    • GCS 404 wrapped through Pulumi diy backend (the exact customer error string) → true
    • GCS 404 without the checkpoint prefix → false (out of scope)
    • S3 NoSuchKey wrapped → true
    • Azure BlobNotFound wrapped → true
    • "failed to load checkpoint: permission denied" → false (not a NotFound shape)

Why a string fallback

The structured path is preferred — gcerrors.Code is the documented surface. But the regression bites precisely because the cloud-storage client's wrap stopped classifying as NotFound. Defending only at the structured layer leaves the regression in place. A scoped string check on a known marker is the smallest patch that restores the contract until upstream gocloud / cloud-storage settles.

Why not just downgrade the storage dep

Could be argued, but:

  1. We'd lose security fixes that came with the bump.
  2. The transitive-dep landscape is unstable enough that another bump could re-trigger the same regression in a different shape.
  3. Defensive normalization at the SC seam matches the contract we already advertise to createStackIfNotExists.

Refs: external consumer report 2026-05-21T12:42:35Z, #279, #280

…okup

External SC consumers on 2026.5.31 (e.g. wize-rooms-api deploy on
2026-05-21) report a hard failure on their first deploy of a parent
stack:

  deployment failed: failed to get parent stack "wize-rooms-api":
  failed to get stack "wize-rooms-api":
  failed to load checkpoint: blob (key ".pulumi/stacks/demo/wize-rooms-api.json")
  (code=Unknown): storage: object doesn't exist:
  googleapi: Error 404: No such object: likeclaw-simple-container-state/...

Customers can recover by pinning back to 2026.5.13.

Root cause: a 404 from the underlying state backend used to round-trip
through gocloud's `Bucket.Exists` as `(false, nil)`, which Pulumi's diy
backend then mapped to `errCheckpointNotFound` and `GetStack` returned
`(nil, nil)`. SC's `createStackIfNotExists` would then CreateStack on
the empty slot. That contract held through 2026.5.13.

Transitive bumps that landed in the consolidated dep PR (#279) — most
notably `cloud.google.com/go/storage` v1.49.0 -> v1.62.2 alongside the
Pulumi pkg/v3 v3.184.0 -> v3.241.0 bump — change the error path so that
some 404s now reach `gcerrors.Code` as `Unknown` rather than `NotFound`.
gocloud's `Bucket.Exists` then returns `(false, wrapped-err)` instead
of `(false, nil)`, `stackExists` wraps that as "failed to load
checkpoint", and `GetStack` returns the wrap. `selectStack` propagates
it up and the deploy fails before `createStackIfNotExists` ever gets a
chance to create the missing stack.

The diff in Pulumi's diy backend code between v3.184.0 and v3.241.0 is
byte-identical for `GetStack` / `stackExists` / `errCheckpointNotFound`
— the regression is purely in the surface area of the transitive
cloud-storage clients.

Fix: in `selectStack`, detect 404-shaped errors from the diy backend's
"failed to load checkpoint" wrap and treat them the same as the
`(nil, nil)` return that v3.184 produced. The structured `gcerrors.Code
== NotFound` path is the first check (handles GCS/S3/Azure clients
whose error chain still classifies cleanly). A scoped string-match
fallback handles the regression cases where the underlying client's
NotFound code doesn't round-trip — we limit the match to error chains
that begin with Pulumi's `"failed to load checkpoint:"` wrapper so we
don't accidentally swallow unrelated NotFound-shaped errors.

Tested:
- New unit test `TestStackCheckpointNotFound` covers the customer's
  exact error string plus S3 NoSuchKey and Azure BlobNotFound variants
  of the same wrap shape. Out-of-scope error patterns (no checkpoint
  prefix, or checkpoint prefix with a non-NotFound underlying error)
  are explicitly asserted to NOT trigger the swallowing path.
- `go test ./pkg/clouds/pulumi/...` passes.
- `go build ./...` passes.

After this lands, consumers can adopt any release >= the next tag and
get the v3.184 GetStack contract back without rolling back. The 2026.5.31
breakage stays fixed even if `cloud.google.com/go/storage` further
changes its error wrapping shape, because the defensive matcher covers
the common provider NotFound markers.

Refs:
- Reported by external consumer 2026-05-21T12:42:35Z on 2026.5.31
- Likely-introduced-by: #279 (Pulumi SDK + cloud.google.com/go/storage bumps)

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD Cre-eD requested a review from smecsia as a code owner May 21, 2026 16:14
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Semgrep Scan Results

Repository: api | Commit: 6a655eb

Check Status Details
⚠️ Semgrep Warning 10 warning(s), 10 total

Scanned at 2026-05-21 17:15 UTC

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

Security Scan Results

Repository: api | Commit: 6a655eb

Check Status Details
✅ Secret Scan Pass No secrets detected
✅ Dependencies (Trivy) Pass 0 total (no critical/high)
✅ Dependencies (Grype) Pass 0 total (no critical/high)
📦 SBOM Generated 528 components (CycloneDX)

Scanned at 2026-05-21 17:15 UTC

Cre-eD added 2 commits May 21, 2026 21:09
Review feedback (Codex + Gemini) on the original patch:
- Codex pointed out that S3 v2 SDK errors surface as
  "operation error S3: HeadObject ... StatusCode: 404 ... api error
  NotFound: Not Found" — neither "NoSuchKey" nor the original
  lowercase "notFound" marker fires on that. S3 users would still
  hit the regression.
- Gemini flagged case-sensitivity ("notFound" vs "Not Found" vs
  "NotFound") as brittle to upstream formatting drift across client
  SDKs.

Both concerns resolved here:

1. Lower-case the haystack once via strings.ToLower and lower-case
   each marker. Now matches "NotFound" / "Not Found" / "notFound" /
   "notfound" uniformly.
2. Expand the marker set:
   - Added "StatusCode: 404" — the AWS SDK v2 HeadObject 404 wrap.
   - Added "Error 404" — covers the customer's exact GCS error
     ("googleapi: Error 404") and most providers that surface the
     HTTP status verbatim.
   - Renamed comment annotations: "S3 v1" -> NoSuchKey,
     "S3 v2" -> api error NotFound / StatusCode 404.

The "404" suffix is load-bearing — virtually every cloud-storage
provider includes the HTTP status code in the wrapped error for a
missing object regardless of the SDK's NotFound enum naming. This
defends the matcher against future SDK shape drift even if specific
enum names disappear.

New test cases covering each provider variant:
- GCS NotFound with capitalized "Not Found" (case-insensitivity)
- S3 v2 SDK "api error NotFound: Not Found" + StatusCode: 404
- Generic StatusCode: 404 wrap (covers future client SDKs)

The negative case ("failed to load checkpoint: permission denied")
still correctly returns false — there's no NotFound-shaped marker
in that message.

Refs: review feedback on parent commit
Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
PR #280 added the controlledValues knob to VPAConfig but didn't
update the user-facing VPA concept doc. Adding a "Controlled Values"
section right after "Controlled Resources" with:
- What the default behavior does (RequestsAndLimits scales limits
  proportionally with requests) and why that's problematic for
  cold-start-heavy workloads.
- What RequestsOnly does and when to use it.
- A complete example showing the 50m floor + RequestsOnly pairing
  that PAY-SPACE adopted as the documented pattern.

Noticed during review of the parent commits on this branch — the
matcher refinements there are user-visible to anyone trying to
opt into the new field, so it's the right time to make the docs
match.

JSON schema regeneration (docs/schemas/kubernetes/) is deferred
to a separate hygiene PR — the schema-gen tool currently drifts
unrelated fields when run against this codebase.

Signed-off-by: Dmitrii Creed <creeed22@gmail.com>
@Cre-eD
Copy link
Copy Markdown
Contributor Author

Cre-eD commented May 21, 2026

Reviewed by Codex + Gemini (independent runs).

Codex

The GCS-specific path appears covered, but the fallback is too narrow for a documented S3 missing-object 404 format.

[P2] For S3 v2 SDK errors that surface as api error NotFound: Not Found / StatusCode: 404, the original marker list (NoSuchKey + lowercase notFound) wouldn't fire.

Gemini

[P0] String matching is brittle and case-sensitive — vulnerable to upstream formatting drift ("notFound" vs "Not Found" vs "NotFound").

[P3] Stronger alternative: errors.As(err, &googleapi.Error{}) typed-error path.

Both points addressed in follow-up commit 806642c:

Change Why
strings.ToLower(msg) + lowercase markers Defends against case-drift
Added statuscode: 404 marker Catches AWS SDK v2 HeadObject 404 wraps + future SDKs
Added error 404 marker Matches the customer's exact googleapi: Error 404 wrap
2 new test cases (S3 v2, case-insensitive, generic 404) Locks in coverage

Style note (not addressed)

  • Gemini P3: typed errors via errors.As would be more robust but means coupling SC to provider-specific error types (*googleapi.Error, AWS SDK v2 *types.NoSuchKey, Azure *azcore.ResponseError). Worth opening as a follow-up if the string fallback ever misfires; for the current regression the broadened matcher is sufficient and self-contained.

Docs

Followup commit 116cd6f documents the controlledValues field added in #280 — flagged during this review pass since the doc gap was noticed while explaining the matcher.

Final tests

go build ./...                                       ✓
go vet ./pkg/clouds/pulumi/...                       ✓
go test ./pkg/clouds/pulumi/... -count=1             ✓
9/9 unit tests pass in TestStackCheckpointNotFound:
  nil, GCS, GCS-no-prefix, S3 v1 NoSuchKey,
  S3 v2 NotFound, Azure BlobNotFound,
  case-insensitivity guard, generic StatusCode: 404,
  permission-denied (negative)

@Cre-eD Cre-eD merged commit bc3a08f into main May 21, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants