ci: retry kurtosis erigon image build on transient registry/cache failures#21693
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves CI resilience for Kurtosis-based workflows by retrying the Erigon Docker image build once when docker/build-push-action fails due to transient Docker registry or GitHub Actions cache backend errors. This reduces merge-queue evictions caused by infrastructure flakes while still failing the job if the build is genuinely broken.
Changes:
- Add a first
docker/build-push-action@v6build attempt withcontinue-on-error: trueand a stepid. - Add a second, identical build step that runs only when the first attempt’s
outcomeisfailure(single retry).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .github/workflows/test-kurtosis-gloas.yml | Adds a conditional one-time retry around the Erigon image build to mitigate transient registry/cache failures. |
| .github/workflows/test-kurtosis-assertoor.yml | Applies the same conditional retry pattern to the Erigon image build used by the assertoor Kurtosis workflow. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yperbasis
approved these changes
Jun 9, 2026
Lord1Egypt
pushed a commit
to Lord1Egypt/erigon
that referenced
this pull request
Jun 10, 2026
…-queue runs (erigontech#21695) ## Problem The kurtosis matrix's third-party images (lighthouse, teku, assertoor, kurtosis engine/core/expander, vector, fluent-bit) are cached via `actions/cache`, but that cache is **only ever written in PR scope, never on the default branch** — so the merge queue gets no protection from it. Why: `cache-warming.yml` calls this workflow with `cache-warming-only: true`, which skips the `assertoor_test` matrix job — the only job that pulls + saves those images. So `docker-cl-*` is never saved to `refs/heads/main`. (Every existing `docker-cl-*` cache entry is scoped to `refs/pull/NNNNN/merge`.) Merge-queue runs execute on ephemeral `gh-readonly-queue/main/*` branches and can only restore caches from the default branch — so **every merge-queue run misses and pulls all 8 images from Docker Hub**, fully exposed to Docker Hub flakes on the path that gates merges. ## Example failure https://github.com/erigontech/erigon/actions/runs/27067265745/job/79891001617 — a merge-queue run for erigontech#21659. The `assertoor_caplin-minimal_parallel_test` shard missed the cache, tried to pull `sigp/lighthouse:v7.0.1`, hit `registry-1.docker.io … context deadline exceeded` on all 3 retry attempts, and fast-cancelled the whole merge-group run. ## Fix Warm the `docker-cl-*` cache **on the default branch** via a dedicated workflow (`warm-kurtosis-cl-images.yml`): - **Triggers:** `push` to `main`/`release/**` filtered on `paths: [test-kurtosis-assertoor.yml]` — the cache key is derived from the pinned image versions, which live in that file, so it only needs re-warming on a version bump — **plus a daily `schedule`** to repopulate the cache if it's LRU-evicted between bumps (the repo sits at the 500 GB cache ceiling, so eviction is active). - It calls `test-kurtosis-assertoor.yml` with a new **`cl-images-only`** input that runs *only* the warm job — `build-erigon-image` and the test matrix are gated off, so the schedule/paths runs don't rebuild the image or run tests. - The warm job uses **`actions/cache/restore@v5` with `lookup-only: true`** + an explicit `actions/cache/save@v5` on miss: when the cache already exists it's a ~10 s no-op (no download), and it only pulls + saves on a genuine miss. Net effect: merge-queue and first-PR runs restore the CL cache from the default branch instead of pulling from Docker Hub. ## Scope - **`build-erigon-image` is intentionally left on its every-push cadence.** Its BuildKit layer cache is source-dependent (the base + `go mod download` layers track `Dockerfile`/`go.mod`/`go.sum`, the compile layer changes every commit), a different concern from these static version-pinned images. Optimizing *its* warming (paths on `Dockerfile`/`go.mod`/`go.sum` + a deps-stage split so the warm skips the compile) is a possible follow-up, not in this PR. - Complementary to erigontech#21693 (retry on the erigon image build) — a different Docker Hub touchpoint. ## Validation - `actionlint` clean on both workflows (the one `SC2086` info is pre-existing, in an untouched `run:` block). - Gating verified: with `cl-images-only`, only `warm-third-party-images` runs; `cache-warming.yml` (`cache-warming-only`) still warms `build-erigon-image` every push; PR/merge_group still run the full matrix and now restore the default-branch CL cache.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
build-erigon-imagejob in the kurtosis workflows (test-kurtosis-assertoor.yml,test-kurtosis-gloas.yml) intermittently fails on transient infrastructure errors unrelated to the code — Docker Hub registry/auth blips and GitHub Actions cache-backend hiccups.docker/build-push-actionhas no retry input, so any such blip fails the whole job (and, in the merge queue, the run).Evidence (last ~30 days)
Scanning every failed
build-erigon-imagejob across Cache Warming + CI Gate runs, 4 of 15 failures were transient infra (the other 10 were real compile breaks clustered on feature branches, plus one GitHub-CDN action-download blip):auth.docker.io/token→504 Gateway Timeoutregistry-1.docker.io … context deadline exceededregistry-1.docker.io … request canceled(timeout)…blob.core.windows.net/actions-cache…)That's ~3–4/month (≈ once a week), and a floor — flakes that someone re-ran to green don't show as failed runs.
Change
Retry the
docker/build-push-actionstep once: the first attempt gets anid+continue-on-error: true, and a second step re-runs the identical build onlyif: steps.build_erigon_image.outcome == 'failure'. The BuildKit layer cache (and the in-job builder) make the retry cheap — it reuses the slow Go compile and only re-attempts whatever flaked (pull / auth / cache export). Applied to both kurtosis workflows.Why this approach
docker/build-push-action@v6has noretryinput (verified against itsaction.yml), so retry must be external.docker buildx buildbash loop — preserves the auto-wired GHA cache runtime token and avoids adding a third-party action (crazy-max/ghaction-github-runtime).Tradeoff
A retry can't distinguish a transient flake from a real compile error, so genuine build breaks now take 2 attempts before failing (~2× feedback time on a broken build). The retry step has no
continue-on-error, so real breaks still go red — they are not masked.Validation
actionlintclean for both files (the twoSC2086infos it prints are pre-existing, in untouchedrun:blocks).Precedent
Same pattern as #21604 (retry SonarCloud scan), #21602 (retry kurtosis engine bootstrap), and #21504 (retry hive image-build/registry failures).