Fix CI false positives from Docker Hub outage: add registry-mirror fallback to buildx boot (issue #100)#101
Conversation
Adding .gitkeep for PR creation (default mode). This file will be removed when the task is complete. Issue: #100
A full Docker Hub registry outage (registry-1.docker.io unreachable for ~2.5 min) made setup-buildx-resilient's pre-pull retries exhaust and the docker-container driver boot's own moby/buildkit pull fail, taking down build-languages-amd64 (java) and cascading 'box*:<ver>-amd64: not found' into every dependent dind build (build-dind java + full, amd64 + arm64). The pre-pull now falls back to a pull-through registry mirror (mirror.gcr.io, on independent infrastructure) and re-tags the image to its canonical reference so the boot reuses the local copy and never touches the failing registry. Adds an opt-in verbose/RUNNER_DEBUG trace and a unit test (experiments/test-issue100-buildx-mirror-fallback.sh) that extracts the real pre-pull script and exercises canonical-ok / mirror-recovery / both-down.
Deep case study for the run 27314587149 failure: timeline, requirement list, root-cause analysis (Docker Hub registry outage during buildx boot + the 3-job cascade), the mirror-fallback fix, prior-art survey, and template comparison. Records the two upstream reports filed against the rust and js pipeline templates (which share the unhardened buildx boot). Adds a patch changeset to trigger a release run that validates the fix end-to-end.
Raw job logs are git-ignored (*.log); this commits the decisive excerpts (root cause + the three cascade 'not found' failures) with source job URLs and a re-download command, matching the repo's curated case-study convention.
Working session summaryAll work is complete. Here's the summary. What was wrong (issue #100)Release run 27314587149 had 4 red jobs, all from a single root cause: The existing The fix (PR #101)
Deliverables addressing every issue requirement
StatusPR #101 is updated, marked ready, working tree clean, main already merged. CI: 3 pass / 24 skip, 0 failures (the build matrix correctly skips for a CI/docs-only PR; the fix executes on the post-merge release run that the changeset triggers). This summary was automatically extracted from the AI working session output. |
🤖 Solution Draft LogThis log file contains the complete execution trace of the AI solution draft process. 💰 Cost estimation:
📊 Context and tokens usage:
Total: (4.1K new + 165.3K cache writes + 4.6M cache reads) input tokens, 47.0K output tokens, $4.543868 cost 🤖 Models used:
📎 Log file uploaded as Gist (1848KB)Now working session is ended, feel free to review and add any feedback on the solution draft. |
🎉 Auto-mergedThis pull request has been automatically merged by hive-mind.
Auto-merged by hive-mind with --auto-merge flag |
Summary
Fixes #100. The release run 27314587149 had four red jobs, all tracing to a single root cause: Docker Hub's registry (
registry-1.docker.io) was unreachable from one amd64 runner for ~2.5 minutes, so booting the buildxdocker-containerdriver could not pullmoby/buildkit:buildx-stable-1. That one failure cascaded:build-languages-amd64 (java)build-dind-amd64 (java)box-java:2.3.0-amd64: not found(java's push never happened)build-dind-amd64 (full)box:2.3.0-amd64: not found(docker-build-pushskipped because the java build failed)build-dind-arm64 (full)box:2.3.0-arm64: not found(same skip chain)Three of the four reds are downstream symptoms — the "false positives" the issue refers to. Eliminating the root cause removes all four together.
Root cause
Issue #97 already added
setup-buildx-resilient, which pre-pulls the BuildKit image with retries so the boot reuses a cached copy. But both the pre-pull and the boot pull only ever talk to Docker Hub, so a full (not intermittent) outage longer than the retry budget defeats it. Retrying the same unreachable host harder is the wrong lever.Fix
setup-buildx-resilientnow seeds the BuildKit image from a pull-through registry mirror on independent infrastructure when Docker Hub is down:mirror.gcr.io/moby/buildkit:buildx-stable-1(Google's public Docker Hub mirror) anddocker tagit back to the canonical reference — the boot then finds it locally and never touches the failing registry.Plus: an opt-in
verboseinput + automatic tracing underRUNNER_DEBUG=1(requirement: add debug output for next-iteration diagnosis), and a tunable retry budget for fast tests.Codebase-wide: all 12
Set up Docker Buildxsteps inrelease.ymlroute through this one composite action, so every build job — JS, essentials, every language, the fullboximage, both arch push jobs, and every dind variant — is hardened by the single edit. No other workflow boots buildx.Tests
experiments/test-issue100-buildx-mirror-fallback.shextracts the real pre-pull script out ofaction.ymland drives it with a mockdocker:Case study & template comparison (issue requirements)
docs/case-studies/issue-100/CASE-STUDY.md(+ raw failed-job logs underci-logs/, original issue, andtemplate-reports.md).Release
Patch changeset added (
.changeset/issue-100-buildx-mirror-fallback.md) so the next release run validates the fix end-to-end.