Skip to content

Fix CI false positives from Docker Hub outage: add registry-mirror fallback to buildx boot (issue #100)#101

Merged
konard merged 4 commits into
mainfrom
issue-100-45581ad55388
Jun 11, 2026
Merged

Fix CI false positives from Docker Hub outage: add registry-mirror fallback to buildx boot (issue #100)#101
konard merged 4 commits into
mainfrom
issue-100-45581ad55388

Conversation

@konard

@konard konard commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #100. The release run 27314587149 had four red jobs, all tracing to a single root cause: Docker Hub's registry (registry-1.docker.io) was unreachable from one amd64 runner for ~2.5 minutes, so booting the buildx docker-container driver could not pull moby/buildkit:buildx-stable-1. That one failure cascaded:

Job Why it was red
build-languages-amd64 (java) root cause — buildx boot pull timed out against Docker Hub
build-dind-amd64 (java) cascade — box-java:2.3.0-amd64: not found (java's push never happened)
build-dind-amd64 (full) cascade — box:2.3.0-amd64: not found (docker-build-push skipped because the java build failed)
build-dind-arm64 (full) cascade — box:2.3.0-arm64: not found (same skip chain)

Three of the four reds are downstream symptoms — the "false positives" the issue refers to. Eliminating the root cause removes all four together.

Root cause

Issue #97 already added setup-buildx-resilient, which pre-pulls the BuildKit image with retries so the boot reuses a cached copy. But both the pre-pull and the boot pull only ever talk to Docker Hub, so a full (not intermittent) outage longer than the retry budget defeats it. Retrying the same unreachable host harder is the wrong lever.

Fix

setup-buildx-resilient now seeds the BuildKit image from a pull-through registry mirror on independent infrastructure when Docker Hub is down:

  1. Pull from Docker Hub with the existing 5× exponential-backoff retries (common blip case).
  2. New: if that exhausts, pull mirror.gcr.io/moby/buildkit:buildx-stable-1 (Google's public Docker Hub mirror) and docker tag it back to the canonical reference — the boot then finds it locally and never touches the failing registry.
  3. If even the mirror fails, fall through to the prior behaviour (no worse than before).

Plus: an opt-in verbose input + automatic tracing under RUNNER_DEBUG=1 (requirement: add debug output for next-iteration diagnosis), and a tunable retry budget for fast tests.

Codebase-wide: all 12 Set up Docker Buildx steps in release.yml route through this one composite action, so every build job — JS, essentials, every language, the full box image, both arch push jobs, and every dind variant — is hardened by the single edit. No other workflow boots buildx.

Tests

experiments/test-issue100-buildx-mirror-fallback.sh extracts the real pre-pull script out of action.yml and drives it with a mock docker:

PASS=15 FAIL=0
All issue #100 buildx mirror-fallback checks passed.

Case study & template comparison (issue requirements)

  • Deep case study with timeline, full requirement list, root-cause analysis, prior-art survey and verification: docs/case-studies/issue-100/CASE-STUDY.md (+ raw failed-job logs under ci-logs/, original issue, and template-reports.md).
  • Compared all four pipeline templates. The rust and js templates boot buildx with no mirror fallback and share the defect; python and csharp don't use buildx. Reported upstream with reproducible example, workaround and suggested fix:
    • rust-…-template#69
    • js-…-template#75

Release

Patch changeset added (.changeset/issue-100-buildx-mirror-fallback.md) so the next release run validates the fix end-to-end.

Adding .gitkeep for PR creation (default mode).
This file will be removed when the task is complete.

Issue: #100
@konard konard self-assigned this Jun 11, 2026
konard added 2 commits June 11, 2026 07:17
A full Docker Hub registry outage (registry-1.docker.io unreachable for
~2.5 min) made setup-buildx-resilient's pre-pull retries exhaust and the
docker-container driver boot's own moby/buildkit pull fail, taking down
build-languages-amd64 (java) and cascading 'box*:<ver>-amd64: not found'
into every dependent dind build (build-dind java + full, amd64 + arm64).

The pre-pull now falls back to a pull-through registry mirror
(mirror.gcr.io, on independent infrastructure) and re-tags the image to its
canonical reference so the boot reuses the local copy and never touches the
failing registry. Adds an opt-in verbose/RUNNER_DEBUG trace and a unit test
(experiments/test-issue100-buildx-mirror-fallback.sh) that extracts the real
pre-pull script and exercises canonical-ok / mirror-recovery / both-down.
Deep case study for the run 27314587149 failure: timeline, requirement list,
root-cause analysis (Docker Hub registry outage during buildx boot + the
3-job cascade), the mirror-fallback fix, prior-art survey, and template
comparison. Records the two upstream reports filed against the rust and js
pipeline templates (which share the unhardened buildx boot). Adds a patch
changeset to trigger a release run that validates the fix end-to-end.
@konard konard changed the title [WIP] Fix all false positives and errors at latest CI/CD run Fix CI false positives from Docker Hub outage: add registry-mirror fallback to buildx boot (issue #100) Jun 11, 2026
@konard konard marked this pull request as ready for review June 11, 2026 07:21
Raw job logs are git-ignored (*.log); this commits the decisive excerpts
(root cause + the three cascade 'not found' failures) with source job URLs
and a re-download command, matching the repo's curated case-study convention.
@konard

konard commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Working session summary

All work is complete. Here's the summary.

What was wrong (issue #100)

Release run 27314587149 had 4 red jobs, all from a single root cause: registry-1.docker.io was unreachable from one amd64 runner for ~2.5 min (00:16:40–00:19:13 UTC). The buildx docker-container driver couldn't pull moby/buildkit:buildx-stable-1, so build-languages-amd64 (java) failed at boot — and that one failure cascaded into three "image not found" jobs (build-dind-amd64 (java) and build-dind full on both arches), because the missing java build skipped docker-build-push and the base images were never built. Three of the four reds are downstream symptoms — the "false positives" the issue names.

The existing setup-buildx-resilient (issue #97) only retries Docker Hub, so a full outage longer than its retry budget defeated it.

The fix (PR #101)

  • setup-buildx-resilient now falls back to a pull-through registry mirror (mirror.gcr.io, independent infra) when Docker Hub is unreachable, then re-tags the image to its canonical reference so the boot reuses the local copy and never touches the failing registry. Strictly no worse than before if even the mirror is down.
  • Verbose/RUNNER_DEBUG tracing added (requirement: add debug output for next-iteration diagnosis).
  • Codebase-wide: all 12 Set up Docker Buildx steps in release.yml route through this one composite action, so every build job is hardened by the single edit.
  • Unit test experiments/test-issue100-buildx-mirror-fallback.sh extracts the real pre-pull script and exercises canonical-ok / mirror-recovery / both-down (15/15 pass).

Deliverables addressing every issue requirement

  • Deep case study in docs/case-studies/issue-100/CASE-STUDY.md (timeline, full requirement list, root-cause analysis, prior-art survey, verification), issue.md, curated ci-logs/README.md evidence, and template-reports.md.
  • Template comparison of all four pipelines: rust + js boot buildx with no mirror fallback (same defect) → reported upstream with reproducible example, workaround and suggested fix (rust #69, js #75); python + csharp don't use buildx.
  • Patch changeset so the next release run validates the fix end-to-end.

Status

PR #101 is updated, marked ready, working tree clean, main already merged. CI: 3 pass / 24 skip, 0 failures (the build matrix correctly skips for a CI/docs-only PR; the fix executes on the post-merge release run that the changeset triggers).


This summary was automatically extracted from the AI working session output.

@konard

konard commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

🤖 Solution Draft Log

This log file contains the complete execution trace of the AI solution draft process.

💰 Cost estimation:

  • Public pricing estimate: $4.543868
  • Calculated by Anthropic: $5.163615
  • Difference: $0.619747 (+13.64%)

📊 Context and tokens usage:

  • 108.0K / 1M (11%) input tokens, 47.0K / 128K (37%) output tokens

Total: (4.1K new + 165.3K cache writes + 4.6M cache reads) input tokens, 47.0K output tokens, $4.543868 cost

🤖 Models used:

  • Tool: Anthropic Claude Code
  • Requested: opus
  • Model: Claude Opus 4.8 (claude-opus-4-8)

📎 Log file uploaded as Gist (1848KB)


Now working session is ended, feel free to review and add any feedback on the solution draft.

@konard konard merged commit 064a8c8 into main Jun 11, 2026
27 checks passed
@konard

konard commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

🎉 Auto-merged

This pull request has been automatically merged by hive-mind.

  • All CI checks have passed

Auto-merged by hive-mind with --auto-merge flag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix all false positives and errors at latest CI/CD run

1 participant