Skip to content

ci: retry Docker Buildx setup on transient failure#21735

Merged
yperbasis merged 1 commit into
mainfrom
yperbasis/kurtosis-buildx-retry
Jun 10, 2026
Merged

ci: retry Docker Buildx setup on transient failure#21735
yperbasis merged 1 commit into
mainfrom
yperbasis/kurtosis-buildx-retry

Conversation

@yperbasis

Copy link
Copy Markdown
Member

In merge-queue run 27280175556 kurtosis / build-erigon-image failed in the "Set up Docker Buildx" step: booting BuildKit pulls moby/buildkit:buildx-stable-1 from Docker Hub, and the pull timed out (Get "https://registry-1.docker.io/v2/": context deadline exceeded), failing the CI Gate for #21723.

The job already retries the erigon image build on transient failures, but the buildx setup step runs before that retry and wasn't covered. This applies the same pattern: the first setup attempt is continue-on-error, and a second attempt runs only if the first failed. Both attempts failing still fails the job.

@lystopad lystopad left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, let's see how it will help.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Kurtosis Assertoor CI workflow against transient Docker Hub connectivity issues by adding a one-time retry around the docker/setup-buildx-action step (which pulls moby/buildkit during bootstrapping).

Changes:

  • Make the first “Set up Docker Buildx” attempt continue-on-error and record its outcome via a step id.
  • Add a conditional second Buildx setup attempt that runs only if the first attempt failed.
  • Preserve failing behavior when both attempts fail (the retry step is not continue-on-error).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yperbasis yperbasis enabled auto-merge June 10, 2026 14:24
@yperbasis yperbasis added this pull request to the merge queue Jun 10, 2026
Merged via the queue into main with commit 5a7c7ad Jun 10, 2026
93 checks passed
@yperbasis yperbasis deleted the yperbasis/kurtosis-buildx-retry branch June 10, 2026 15:47
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 11, 2026
…rigontech#21741)

## Why

`docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit`
from Docker Hub on every run — the last uncached Docker Hub dependency
in the kurtosis jobs. In merge-queue run
[27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019)
that pull timed out (`Get "https://registry-1.docker.io/v2/": context
deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR
closing the equivalent gap for the kurtosis engine-bootstrap images. The
same Docker Hub connectivity window took out three other CI Gate runs
that morning at the `kurtosis engine start` step.

erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an
outage longer than its window. This removes the hard dependency the same
way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723).

## What

- Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0`
(what the moving `buildx-stable-1` tag currently resolves to) and pass
it to `setup-buildx-action` via `driver-opts: image=...`, in both
`test-kurtosis-assertoor.yml` (`build-erigon-image`) and
`test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other
pinned images.
- Cache it under a single shared key (`docker-buildkit-<image>`) with
the established docker save/load + actions/cache pattern, and `docker
load` it before buildx setup. buildx's docker-container driver falls
back to a locally present image when its pull fails ("pulling failed,
using local image"), so with a warm cache the builder boots even while
Docker Hub is fully unreachable. Pinning via driver-opts is what makes
the fallback engage — the local image name must match what buildx wants
to boot.
- The cache-fill pull in the test jobs is best-effort
(`continue-on-error`, save gated on pull success): buildx pulls the
image itself either way, so a failed seed must not fail an
otherwise-good run, and a failed pull never poisons the cache key with
an empty archive.
- Warm jobs (`warm-third-party-images` in both files) pull strictly and
save the same key — producing the cache is their purpose. Both
cache-warming workflows already path-filter on the edited files, so the
cache is created on main right after this merges and refreshed daily
against LRU eviction.

In gloas, buildx setup previously ran before any caching; the buildkit
cache steps are inserted ahead of it.

Not covered (non-gating, can follow up if wanted):
`ci-cd-main-branch-docker-images.yml` and `release.yml` also use
`setup-buildx-action` but don't block PRs or the merge queue.

actionlint is clean (the two SC2086 infos it reports pre-exist on main
in the kurtosis CLI install step).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants