fix(image): pin dev/warm pods + prepuller to immutable hash tag (stop stale :latest serving + warm thrash) by wdvr · Pull Request #202 · wdvr/osdc

wdvr · 2026-06-02T06:59:19Z

Root cause (the '3 hours and codex still hangs' bug)

Pods used :latest + imagePullPolicy: IfNotPresent. Images are cached per node. After a rebuild, a node that already has an old :latest cached does not re-pull — the kubelet serves the stale image to every new pod until the prepuller finishes re-pulling the 27 GB image (~5–6 min × 24 nodes).

Consequences observed live this session:

A fresh cold reserve in that window got the old (broken-codex) image.
The fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name #199 warm rotation recycled old-image pods that instantly came back on the cached old :latest → recycled again → thrash (pods recreated every 1–2 s, none ever on the new image).

Verified: :latest digest was the new image, but 0 pods were running it; all 24 prepuller pods were still mid-pull (Init:0/1).

Fix

Pin dev + warm pods (GPU_DEV_CONTAINER_IMAGE) and the prepuller to the immutable hash tag latest-<context-hash> (local.full_image_uri) — the exact pattern #191 already uses for the build/ondemand jobs. Each rebuild = a tag the node has never seen, so IfNotPresent pulls the new image → guaranteed-correct, no stale window, and the warm rotation converges (a recycled pod can't come up on the old cache — it must pull the new tag). Prepuller pinned to the same tag so it pre-warms the exact ref.

Tag is immutable/stable → OOM-restart still works (the original reason :latest was used).
ami-baker/eks user-data keep :latest (boot-time layer prewarm; same digest → pod pull is manifest-only/fast). Correctness now comes from the hash tag, not the prewarm.
No docker files changed → no image rebuild on apply. This is a lambda-env + prepuller-DS change only (fast apply).

Root cause of 'new image never reaches pods' (codex stayed broken after apply) and the warm-rotation thrash: Pods used :latest + imagePullPolicy=IfNotPresent. After a rebuild, a node that already has an old :latest cached does NOT re-pull — kubelet serves the stale image to every new pod until the prepuller finishes re-pulling 27GB (~5-6min/node, 24 nodes). A cold reserve in that window gets the old (broken-codex) image; and the #199 warm rotation recycles old-image pods that instantly come back on the cached old :latest -> recycled again -> thrash. Fix (the pattern #191 already uses for build jobs): pin pods to the immutable hash tag latest-<context-hash> (local.full_image_uri). Each rebuild = a tag the node has never seen, so IfNotPresent pulls the NEW image -> guaranteed-correct, no stale window, and the warm rotation converges (the recycled pod can't come up on the old cache; it pulls the new tag). Prepuller pinned to the same tag so it pre-warms the exact ref. Tag is immutable/stable, so OOM-restart still works. ami-baker/eks user-data keep :latest (boot-time LAYER prewarm; same digest, fast manifest-only pod pull). No docker files changed -> no image rebuild on apply; this is a lambda-env + prepuller-DS change only.

wdvr merged commit 3abb081 into main Jun 2, 2026
3 checks passed

wdvr deleted the fix/pin-pods-to-hash-tag branch June 2, 2026 07:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(image): pin dev/warm pods + prepuller to immutable hash tag (stop stale :latest serving + warm thrash)#202

fix(image): pin dev/warm pods + prepuller to immutable hash tag (stop stale :latest serving + warm thrash)#202
wdvr merged 1 commit into
mainfrom
fix/pin-pods-to-hash-tag

wdvr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wdvr commented Jun 2, 2026

Root cause (the '3 hours and codex still hangs' bug)

Fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant