ci: warm kurtosis third-party image cache on default branch for merge-queue runs#21695
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Kurtosis Assertoor reusable workflow to ensure the docker-cl-* third-party image cache is written on the default branch during cache-warming runs, so merge-queue runs can reliably restore it and avoid pulling these images from Docker Hub.
Changes:
- Add a
warm-third-party-imagesjob that runs only wheninputs.cache-warming-onlyis true. - Restore (and, on miss, populate) the
docker-cl-*cache by pulling anddocker save-ing the pinned third-party images into/tmp/docker-cache.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lystopad
approved these changes
Jun 9, 2026
Member
|
LGTM. |
Lord1Egypt
pushed a commit
to Lord1Egypt/erigon
that referenced
this pull request
Jun 10, 2026
…orkflow (erigontech#21703) ## Problem `test-kurtosis-gloas.yml` restores the `docker-cl-*` third-party image cache but nothing ever warms it on the default branch: - It triggers on `pull_request` + `workflow_dispatch` only — no `push` / `merge_group` / `schedule`. - Its cache `save` is gated `github.event_name != 'pull_request'` (since erigontech#21602), and its only non-PR trigger is manual `workflow_dispatch`. So the cache is never systematically populated on the default branch, and gloas PR runs cold-pull all 7 third-party images from Docker Hub on ~every run — exposed to the same Docker Hub flakes as the assertoor workflow, protected only by the 3-attempt retry. This is a **different cache key** from the assertoor warmer (erigontech#21695) — gloas's key omits `TEKU` — so that warmer doesn't cover it. And unlike assertoor, gloas does **not** run in `merge_group`, so a flake here fails a PR check rather than bouncing the merge queue — lower blast radius, same root gap. ## Fix Mirror erigontech#21695 for gloas: - Make `test-kurtosis-gloas.yml` callable (`workflow_call`) with a new **`cl-images-only`** input that runs *only* a new `warm-third-party-images` job; the `gloas_test` matrix is gated off under it. - The warm job uses **`actions/cache/restore@v5` (`lookup-only`)** + an explicit **`actions/cache/save@v5`** on miss — a ~10 s no-op when the cache already exists, pull + save only on a genuine miss. - New **`cache-warming-kurtosis-gloas-images.yml`** drives it on `push` to `main`/`release/**` filtered on `paths: [test-kurtosis-gloas.yml]` (re-warm on version bumps) + a daily `schedule` (repopulate after LRU eviction at the 500 GB cache ceiling) + `workflow_dispatch`. Net effect: gloas PR runs restore the cache from the default branch instead of pulling from Docker Hub. ## Notes - Follow-up to erigontech#21695 (assertoor CL cache); same pattern, gloas's own key/image set (no teku). - The pinned image versions are duplicated across `test-kurtosis-assertoor.yml` and `test-kurtosis-gloas.yml` (and now their two warmers). Unifying them into a single source so one warmer covers both is a sensible future cleanup — called out, not done here. ## Validation - `actionlint` clean on both workflows (the one `SC2086` info is pre-existing, in an untouched `run:` block). - Gating: with `cl-images-only`, only `warm-third-party-images` runs; on `pull_request` / `workflow_dispatch` (no input), `gloas_test` runs as before. The PR-merge touches `test-kurtosis-gloas.yml`, so the `paths` trigger warms the cache on merge — no cold-start gap. Co-authored-by: lystopad <oleksandr.lystopad@erigon.tech>
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
Jun 11, 2026
…rigontech#21741) ## Why `docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit` from Docker Hub on every run — the last uncached Docker Hub dependency in the kurtosis jobs. In merge-queue run [27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019) that pull timed out (`Get "https://registry-1.docker.io/v2/": context deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR closing the equivalent gap for the kurtosis engine-bootstrap images. The same Docker Hub connectivity window took out three other CI Gate runs that morning at the `kurtosis engine start` step. erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an outage longer than its window. This removes the hard dependency the same way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723). ## What - Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0` (what the moving `buildx-stable-1` tag currently resolves to) and pass it to `setup-buildx-action` via `driver-opts: image=...`, in both `test-kurtosis-assertoor.yml` (`build-erigon-image`) and `test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other pinned images. - Cache it under a single shared key (`docker-buildkit-<image>`) with the established docker save/load + actions/cache pattern, and `docker load` it before buildx setup. buildx's docker-container driver falls back to a locally present image when its pull fails ("pulling failed, using local image"), so with a warm cache the builder boots even while Docker Hub is fully unreachable. Pinning via driver-opts is what makes the fallback engage — the local image name must match what buildx wants to boot. - The cache-fill pull in the test jobs is best-effort (`continue-on-error`, save gated on pull success): buildx pulls the image itself either way, so a failed seed must not fail an otherwise-good run, and a failed pull never poisons the cache key with an empty archive. - Warm jobs (`warm-third-party-images` in both files) pull strictly and save the same key — producing the cache is their purpose. Both cache-warming workflows already path-filter on the edited files, so the cache is created on main right after this merges and refreshed daily against LRU eviction. In gloas, buildx setup previously ran before any caching; the buildkit cache steps are inserted ahead of it. Not covered (non-gating, can follow up if wanted): `ci-cd-main-branch-docker-images.yml` and `release.yml` also use `setup-buildx-action` but don't block PRs or the merge queue. actionlint is clean (the two SC2086 infos it reports pre-exist on main in the kurtosis CLI install step).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The kurtosis matrix's third-party images (lighthouse, teku, assertoor, kurtosis engine/core/expander, vector, fluent-bit) are cached via
actions/cache, but that cache is only ever written in PR scope, never on the default branch — so the merge queue gets no protection from it.Why:
cache-warming.ymlcalls this workflow withcache-warming-only: true, which skips theassertoor_testmatrix job — the only job that pulls + saves those images. Sodocker-cl-*is never saved torefs/heads/main. (Every existingdocker-cl-*cache entry is scoped torefs/pull/NNNNN/merge.) Merge-queue runs execute on ephemeralgh-readonly-queue/main/*branches and can only restore caches from the default branch — so every merge-queue run misses and pulls all 8 images from Docker Hub, fully exposed to Docker Hub flakes on the path that gates merges.Example failure
https://github.com/erigontech/erigon/actions/runs/27067265745/job/79891001617 — a merge-queue run for #21659. The
assertoor_caplin-minimal_parallel_testshard missed the cache, tried to pullsigp/lighthouse:v7.0.1, hitregistry-1.docker.io … context deadline exceededon all 3 retry attempts, and fast-cancelled the whole merge-group run.Fix
Warm the
docker-cl-*cache on the default branch via a dedicated workflow (warm-kurtosis-cl-images.yml):pushtomain/release/**filtered onpaths: [test-kurtosis-assertoor.yml]— the cache key is derived from the pinned image versions, which live in that file, so it only needs re-warming on a version bump — plus a dailyscheduleto repopulate the cache if it's LRU-evicted between bumps (the repo sits at the 500 GB cache ceiling, so eviction is active).test-kurtosis-assertoor.ymlwith a newcl-images-onlyinput that runs only the warm job —build-erigon-imageand the test matrix are gated off, so the schedule/paths runs don't rebuild the image or run tests.actions/cache/restore@v5withlookup-only: true+ an explicitactions/cache/save@v5on miss: when the cache already exists it's a ~10 s no-op (no download), and it only pulls + saves on a genuine miss.Net effect: merge-queue and first-PR runs restore the CL cache from the default branch instead of pulling from Docker Hub.
Scope
build-erigon-imageis intentionally left on its every-push cadence. Its BuildKit layer cache is source-dependent (the base +go mod downloadlayers trackDockerfile/go.mod/go.sum, the compile layer changes every commit), a different concern from these static version-pinned images. Optimizing its warming (paths onDockerfile/go.mod/go.sum+ a deps-stage split so the warm skips the compile) is a possible follow-up, not in this PR.Validation
actionlintclean on both workflows (the oneSC2086info is pre-existing, in an untouchedrun:block).cl-images-only, onlywarm-third-party-imagesruns;cache-warming.yml(cache-warming-only) still warmsbuild-erigon-imageevery push; PR/merge_group still run the full matrix and now restore the default-branch CL cache.