fix(ondemand): wait_for_rollout=false on the build worker (stop apply failing on its cold image pull) by wdvr · Pull Request #201 · wdvr/osdc

wdvr · 2026-06-02T06:15:16Z

Symptom

tofu apply errors on:

Error: Waiting for rollout to finish: 1 replicas wanted; 0 replicas Ready
  with kubernetes_deployment_v1.pytorch_ondemand

…but the deployment is actually healthy right after (revision converges, pod 1/1 Running).

Cause

The on-demand build worker is a Recreate-strategy Deployment pinned to the build node, which has no image-prepuller. On every image change it tears down the old pod first, then cold-pulls the ~28 GB image (measured 5m42s) with 0 replicas Ready in the gap. That exceeds terraform's default wait_for_rollout window → the apply fails, even though the rollout finishes seconds/minutes later. This will recur on every image rebuild.

Fix

wait_for_rollout = false on this resource. It's a background worker (requesters fall through to in-pod builds while it's restarting), so the apply shouldn't block or fail on it. Mirrors the existing pytorch-snapshot DaemonSet (git-cache.tf:55).

tf-only, doesn't touch the docker image hash.

The on-demand builder is a Recreate-strategy Deployment on the build node, which has no image-prepuller, so every image change makes it cold-pull the ~28GB image (~6min) with 0 replicas Ready in the gap. That exceeds the default wait_for_rollout window and FAILS the apply ('1 replicas wanted; 0 replicas Ready') even though the rollout converges seconds later. The worker isn't user-facing (requesters fall through to in-pod builds while it's down), so don't gate the apply on it — same as the pytorch-snapshot DaemonSet (git-cache.tf:55).

wdvr merged commit 8104304 into main Jun 2, 2026
3 checks passed

wdvr deleted the fix/ondemand-wait-for-rollout branch June 2, 2026 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ondemand): wait_for_rollout=false on the build worker (stop apply failing on its cold image pull)#201

fix(ondemand): wait_for_rollout=false on the build worker (stop apply failing on its cold image pull)#201
wdvr merged 1 commit into
mainfrom
fix/ondemand-wait-for-rollout

wdvr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Jun 2, 2026

Symptom

Cause

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant