fix(warm): free DNS name on recycle + rotate stale-image warm pods + unify pod name#199
Merged
Merged
Conversation
…unify pod name Root cause of '-2' suffixes on fresh reservations: a recycled warm pod deleted the pod + service but never deleted its placeholder SSH domain mapping, written with a 7-day expiry and reservation_id='warm'. generate_unique_name() counts every non-expired mapping as taken, so the random adjective_animal picker was colliding against a junk pile of orphans (prod: 449 non-expired mappings, ALL 'warm', vs only ~24 live warm pods). - _recycle_warm_pod now delete_domain_mapping(warm-domain) so the name frees immediately; placeholder expiry shortened from 7d to WARM_POD_MAX_AGE_HOURS+2h so any missed orphan self-cleans via DynamoDB TTL within hours, not days. - reconcile_warm_pool rotates idle warm pods off a stale image after a rebuild: compares each ready pod's running digest to the :latest digest in ECR and recycles at most ONE per type per tick (gradual; never touches claimed pods). Needs ecr:DescribeImages on the processor role (added in lambda.tf). - warm claim stamps GPU_DEV_HOSTLABEL=gpu-dev-<resid8> into the shell-ext files; the image prompt (zshrc/bashrc) prefers it over %m/\h so a warm-claimed pod's prompt shows the same handle you connect with (== the SSH alias), instead of the warm-pool hostname. Cold pods leave it unset (hostname already matches). Tests: +18 unit (recycle mapping cleanup, digest helpers, rotation cap/guards, short placeholder expiry, HOSTLABEL stamp). Full suite 1166 passed.
wdvr
added a commit
that referenced
this pull request
Jun 2, 2026
…202) Root cause of 'new image never reaches pods' (codex stayed broken after apply) and the warm-rotation thrash: Pods used :latest + imagePullPolicy=IfNotPresent. After a rebuild, a node that already has an old :latest cached does NOT re-pull — kubelet serves the stale image to every new pod until the prepuller finishes re-pulling 27GB (~5-6min/node, 24 nodes). A cold reserve in that window gets the old (broken-codex) image; and the #199 warm rotation recycles old-image pods that instantly come back on the cached old :latest -> recycled again -> thrash. Fix (the pattern #191 already uses for build jobs): pin pods to the immutable hash tag latest-<context-hash> (local.full_image_uri). Each rebuild = a tag the node has never seen, so IfNotPresent pulls the NEW image -> guaranteed-correct, no stale window, and the warm rotation converges (the recycled pod can't come up on the old cache; it pulls the new tag). Prepuller pinned to the same tag so it pre-warms the exact ref. Tag is immutable/stable, so OOM-restart still works. ami-baker/eks user-data keep :latest (boot-time LAYER prewarm; same digest, fast manifest-only pod pull). No docker files changed -> no image rebuild on apply; this is a lambda-env + prepuller-DS change only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Diagnosing why fresh reservations kept getting
-2suffixes (e.g.bright_fox-2.devservers.io). It's not the birthday paradox over the ~13.7k name pool — it's a leak.A recycled warm pod (
_recycle_warm_pod) deleted the pod + SSH service but never deleted its placeholder domain mapping, which is written with a 7-day expiry andreservation_id="warm".generate_unique_name()counts every non-expired mapping as taken, so the picker collides against a growing junk pile.Prod data at time of diagnosis: 449 non-expired mappings, every single one
reservation_id="warm", vs only ~24 live warm pods. ~425 orphans, accumulating in 30–65-count bursts every ~12h (the warm-pool recycle cycles from image rebuilds), each holding a name for 7 days.What
_recycle_warm_podnowdelete_domain_mapping(warm-domain). Placeholder expiry shortened 7d →WARM_POD_MAX_AGE_HOURS + 2h, so any orphan a missed-delete leaves self-cleans via DynamoDB TTL within hours, not days.reconcile_warm_poolcompares each ready pod's running digest to the:latestdigest in ECR and recycles at most one per type per tick (gradual; the deficit backfill recreates it on the fresh image). Never toucheswarm-state=claimedpods. Needsecr:DescribeImageson the processor role (added).GPU_DEV_HOSTLABEL=gpu-dev-<resid8>into the shell-ext files; the image prompt prefers it over%m/\h, so a warm-claimed pod's prompt shows the same handle you connect with (== the SSH alias from fix(cli): SSH alias keys off reservation id (warm pods reachable by resid) #185) instead of the warm-pool hostnamegpu-dev-b200-<hex>. Cold pods leave it unset (their hostname already matches).Tests
+18 unit tests (recycle mapping cleanup + error swallow, digest parse/resolve helpers, rotation: rotates one / caps one-per-tick / skips on digest match / skips on unknown digest / never claimed, short placeholder expiry, HOSTLABEL stamp). Full suite: 1166 passed.
Deploy notes
tofu apply(prod) for the lambda + IAM. The image prompt change needs the image rebuild (rides the pending feat(image): Codex CLI on GPT-5.5 via Bedrock (no per-user key) #198 rebuild).warmmappings not referenced by a live warm pod'swarm-domainannotation.