Skip to content

ci: retry kurtosis erigon image build on transient registry/cache failures#21693

Merged
taratorio merged 2 commits into
mainfrom
taratorio/kurtosis-retry-image-build
Jun 9, 2026
Merged

ci: retry kurtosis erigon image build on transient registry/cache failures#21693
taratorio merged 2 commits into
mainfrom
taratorio/kurtosis-retry-image-build

Conversation

@taratorio

Copy link
Copy Markdown
Member

Problem

The build-erigon-image job in the kurtosis workflows (test-kurtosis-assertoor.yml, test-kurtosis-gloas.yml) intermittently fails on transient infrastructure errors unrelated to the code — Docker Hub registry/auth blips and GitHub Actions cache-backend hiccups. docker/build-push-action has no retry input, so any such blip fails the whole job (and, in the merge queue, the run).

Evidence (last ~30 days)

Scanning every failed build-erigon-image job across Cache Warming + CI Gate runs, 4 of 15 failures were transient infra (the other 10 were real compile breaks clustered on feature branches, plus one GitHub-CDN action-download blip):

Date Branch Signature
06-08 main Docker Hub auth.docker.io/token504 Gateway Timeout
06-02 main registry-1.docker.io … context deadline exceeded
06-02 glamsterdam-devnet-4 registry-1.docker.io … request canceled (timeout)
05-28 main GHA cache blob write 5xx (…blob.core.windows.net/actions-cache…)

That's ~3–4/month (≈ once a week), and a floor — flakes that someone re-ran to green don't show as failed runs.

Change

Retry the docker/build-push-action step once: the first attempt gets an id + continue-on-error: true, and a second step re-runs the identical build only if: steps.build_erigon_image.outcome == 'failure'. The BuildKit layer cache (and the in-job builder) make the retry cheap — it reuses the slow Go compile and only re-attempts whatever flaked (pull / auth / cache export). Applied to both kurtosis workflows.

Why this approach

  • docker/build-push-action@v6 has no retry input (verified against its action.yml), so retry must be external.
  • Keeping the action — vs. converting to a raw docker buildx build bash loop — preserves the auto-wired GHA cache runtime token and avoids adding a third-party action (crazy-max/ghaction-github-runtime).
  • Retrying the whole step covers cache-export failures too, not just registry pulls.

Tradeoff

A retry can't distinguish a transient flake from a real compile error, so genuine build breaks now take 2 attempts before failing (~2× feedback time on a broken build). The retry step has no continue-on-error, so real breaks still go red — they are not masked.

Validation

  • actionlint clean for both files (the two SC2086 infos it prints are pre-existing, in untouched run: blocks).
  • Behaviour: attempt 1 succeeds → retry skipped; attempt 1 fails + attempt 2 succeeds → job green, image present for the downstream artifact/test steps; both fail → job red.

Precedent

Same pattern as #21604 (retry SonarCloud scan), #21602 (retry kurtosis engine bootstrap), and #21504 (retry hive image-build/registry failures).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CI resilience for Kurtosis-based workflows by retrying the Erigon Docker image build once when docker/build-push-action fails due to transient Docker registry or GitHub Actions cache backend errors. This reduces merge-queue evictions caused by infrastructure flakes while still failing the job if the build is genuinely broken.

Changes:

  • Add a first docker/build-push-action@v6 build attempt with continue-on-error: true and a step id.
  • Add a second, identical build step that runs only when the first attempt’s outcome is failure (single retry).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
.github/workflows/test-kurtosis-gloas.yml Adds a conditional one-time retry around the Erigon image build to mitigate transient registry/cache failures.
.github/workflows/test-kurtosis-assertoor.yml Applies the same conditional retry pattern to the Erigon image build used by the assertoor Kurtosis workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yperbasis yperbasis added the QA label Jun 9, 2026
@taratorio taratorio enabled auto-merge June 9, 2026 10:57
@taratorio taratorio added this pull request to the merge queue Jun 9, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 9, 2026
@taratorio taratorio added this pull request to the merge queue Jun 9, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 9, 2026
@taratorio taratorio enabled auto-merge June 9, 2026 14:04
@taratorio taratorio added this pull request to the merge queue Jun 9, 2026
Merged via the queue into main with commit bee49cb Jun 9, 2026
169 of 172 checks passed
@taratorio taratorio deleted the taratorio/kurtosis-retry-image-build branch June 9, 2026 15:07
Lord1Egypt pushed a commit to Lord1Egypt/erigon that referenced this pull request Jun 10, 2026
…-queue runs (erigontech#21695)

## Problem

The kurtosis matrix's third-party images (lighthouse, teku, assertoor,
kurtosis engine/core/expander, vector, fluent-bit) are cached via
`actions/cache`, but that cache is **only ever written in PR scope,
never on the default branch** — so the merge queue gets no protection
from it.

Why: `cache-warming.yml` calls this workflow with `cache-warming-only:
true`, which skips the `assertoor_test` matrix job — the only job that
pulls + saves those images. So `docker-cl-*` is never saved to
`refs/heads/main`. (Every existing `docker-cl-*` cache entry is scoped
to `refs/pull/NNNNN/merge`.) Merge-queue runs execute on ephemeral
`gh-readonly-queue/main/*` branches and can only restore caches from the
default branch — so **every merge-queue run misses and pulls all 8
images from Docker Hub**, fully exposed to Docker Hub flakes on the path
that gates merges.

## Example failure


https://github.com/erigontech/erigon/actions/runs/27067265745/job/79891001617
— a merge-queue run for erigontech#21659. The
`assertoor_caplin-minimal_parallel_test` shard missed the cache, tried
to pull `sigp/lighthouse:v7.0.1`, hit `registry-1.docker.io … context
deadline exceeded` on all 3 retry attempts, and fast-cancelled the whole
merge-group run.

## Fix

Warm the `docker-cl-*` cache **on the default branch** via a dedicated
workflow (`warm-kurtosis-cl-images.yml`):

- **Triggers:** `push` to `main`/`release/**` filtered on `paths:
[test-kurtosis-assertoor.yml]` — the cache key is derived from the
pinned image versions, which live in that file, so it only needs
re-warming on a version bump — **plus a daily `schedule`** to repopulate
the cache if it's LRU-evicted between bumps (the repo sits at the 500 GB
cache ceiling, so eviction is active).
- It calls `test-kurtosis-assertoor.yml` with a new **`cl-images-only`**
input that runs *only* the warm job — `build-erigon-image` and the test
matrix are gated off, so the schedule/paths runs don't rebuild the image
or run tests.
- The warm job uses **`actions/cache/restore@v5` with `lookup-only:
true`** + an explicit `actions/cache/save@v5` on miss: when the cache
already exists it's a ~10 s no-op (no download), and it only pulls +
saves on a genuine miss.

Net effect: merge-queue and first-PR runs restore the CL cache from the
default branch instead of pulling from Docker Hub.

## Scope

- **`build-erigon-image` is intentionally left on its every-push
cadence.** Its BuildKit layer cache is source-dependent (the base + `go
mod download` layers track `Dockerfile`/`go.mod`/`go.sum`, the compile
layer changes every commit), a different concern from these static
version-pinned images. Optimizing *its* warming (paths on
`Dockerfile`/`go.mod`/`go.sum` + a deps-stage split so the warm skips
the compile) is a possible follow-up, not in this PR.
- Complementary to erigontech#21693 (retry on the erigon image build) — a
different Docker Hub touchpoint.

## Validation

- `actionlint` clean on both workflows (the one `SC2086` info is
pre-existing, in an untouched `run:` block).
- Gating verified: with `cl-images-only`, only `warm-third-party-images`
runs; `cache-warming.yml` (`cache-warming-only`) still warms
`build-erigon-image` every push; PR/merge_group still run the full
matrix and now restore the default-branch CL cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants