Skip to content

fix(image): pin dev/warm pods + prepuller to immutable hash tag (stop stale :latest serving + warm thrash)#202

Merged
wdvr merged 1 commit into
mainfrom
fix/pin-pods-to-hash-tag
Jun 2, 2026
Merged

fix(image): pin dev/warm pods + prepuller to immutable hash tag (stop stale :latest serving + warm thrash)#202
wdvr merged 1 commit into
mainfrom
fix/pin-pods-to-hash-tag

Conversation

@wdvr
Copy link
Copy Markdown
Owner

@wdvr wdvr commented Jun 2, 2026

Root cause (the '3 hours and codex still hangs' bug)

Pods used :latest + imagePullPolicy: IfNotPresent. Images are cached per node. After a rebuild, a node that already has an old :latest cached does not re-pull — the kubelet serves the stale image to every new pod until the prepuller finishes re-pulling the 27 GB image (~5–6 min × 24 nodes).

Consequences observed live this session:

Verified: :latest digest was the new image, but 0 pods were running it; all 24 prepuller pods were still mid-pull (Init:0/1).

Fix

Pin dev + warm pods (GPU_DEV_CONTAINER_IMAGE) and the prepuller to the immutable hash tag latest-<context-hash> (local.full_image_uri) — the exact pattern #191 already uses for the build/ondemand jobs. Each rebuild = a tag the node has never seen, so IfNotPresent pulls the new image → guaranteed-correct, no stale window, and the warm rotation converges (a recycled pod can't come up on the old cache — it must pull the new tag). Prepuller pinned to the same tag so it pre-warms the exact ref.

  • Tag is immutable/stable → OOM-restart still works (the original reason :latest was used).
  • ami-baker/eks user-data keep :latest (boot-time layer prewarm; same digest → pod pull is manifest-only/fast). Correctness now comes from the hash tag, not the prewarm.
  • No docker files changed → no image rebuild on apply. This is a lambda-env + prepuller-DS change only (fast apply).

Root cause of 'new image never reaches pods' (codex stayed broken after apply) and
the warm-rotation thrash:

Pods used :latest + imagePullPolicy=IfNotPresent. After a rebuild, a node that
already has an old :latest cached does NOT re-pull — kubelet serves the stale
image to every new pod until the prepuller finishes re-pulling 27GB (~5-6min/node,
24 nodes). A cold reserve in that window gets the old (broken-codex) image; and
the #199 warm rotation recycles old-image pods that instantly come back on the
cached old :latest -> recycled again -> thrash.

Fix (the pattern #191 already uses for build jobs): pin pods to the immutable
hash tag latest-<context-hash> (local.full_image_uri). Each rebuild = a tag the
node has never seen, so IfNotPresent pulls the NEW image -> guaranteed-correct, no
stale window, and the warm rotation converges (the recycled pod can't come up on
the old cache; it pulls the new tag). Prepuller pinned to the same tag so it
pre-warms the exact ref. Tag is immutable/stable, so OOM-restart still works.

ami-baker/eks user-data keep :latest (boot-time LAYER prewarm; same digest, fast
manifest-only pod pull). No docker files changed -> no image rebuild on apply;
this is a lambda-env + prepuller-DS change only.
@wdvr wdvr merged commit 3abb081 into main Jun 2, 2026
3 checks passed
@wdvr wdvr deleted the fix/pin-pods-to-hash-tag branch June 2, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant