fix(ondemand): wait_for_rollout=false on the build worker (stop apply failing on its cold image pull)#201
Merged
Merged
Conversation
The on-demand builder is a Recreate-strategy Deployment on the build node, which
has no image-prepuller, so every image change makes it cold-pull the ~28GB image
(~6min) with 0 replicas Ready in the gap. That exceeds the default
wait_for_rollout window and FAILS the apply ('1 replicas wanted; 0 replicas
Ready') even though the rollout converges seconds later. The worker isn't
user-facing (requesters fall through to in-pod builds while it's down), so don't
gate the apply on it — same as the pytorch-snapshot DaemonSet (git-cache.tf:55).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
tofu applyerrors on:…but the deployment is actually healthy right after (revision converges, pod
1/1 Running).Cause
The on-demand build worker is a Recreate-strategy Deployment pinned to the build node, which has no image-prepuller. On every image change it tears down the old pod first, then cold-pulls the ~28 GB image (measured 5m42s) with 0 replicas Ready in the gap. That exceeds terraform's default
wait_for_rolloutwindow → the apply fails, even though the rollout finishes seconds/minutes later. This will recur on every image rebuild.Fix
wait_for_rollout = falseon this resource. It's a background worker (requesters fall through to in-pod builds while it's restarting), so the apply shouldn't block or fail on it. Mirrors the existingpytorch-snapshotDaemonSet (git-cache.tf:55).tf-only, doesn't touch the docker image hash.