ci: warm kurtosis GLOAS image cache via dedicated paths + scheduled workflow#21703
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a CI gap where test-kurtosis-gloas.yml restores a docker-cl-* image cache but (due to triggers + save-gating) the cache is not systematically populated on the default branch, causing PR runs to frequently cold-pull third-party images from Docker Hub.
Changes:
- Adds a reusable-workflow entrypoint to
test-kurtosis-gloas.yml(workflow_call) with acl-images-onlyinput that runs a dedicatedwarm-third-party-imagesjob and gates off the main GLOAS test matrix. - Implements cache probing via
actions/cache/restore@v5withlookup-only: true, pulling + saving images only when the cache key is missing. - Introduces a new scheduled/path-filtered workflow to warm the GLOAS cache on
main/release/**and daily viaschedule.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
.github/workflows/test-kurtosis-gloas.yml |
Adds workflow_call + cl-images-only gating and a dedicated cache-warming job that only pulls/saves on genuine cache misses. |
.github/workflows/cache-warming-kurtosis-gloas-images.yml |
New driver workflow to populate the default-branch cache via push (paths-filtered), schedule, and manual dispatch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
Jun 11, 2026
…rigontech#21741) ## Why `docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit` from Docker Hub on every run — the last uncached Docker Hub dependency in the kurtosis jobs. In merge-queue run [27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019) that pull timed out (`Get "https://registry-1.docker.io/v2/": context deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR closing the equivalent gap for the kurtosis engine-bootstrap images. The same Docker Hub connectivity window took out three other CI Gate runs that morning at the `kurtosis engine start` step. erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an outage longer than its window. This removes the hard dependency the same way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723). ## What - Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0` (what the moving `buildx-stable-1` tag currently resolves to) and pass it to `setup-buildx-action` via `driver-opts: image=...`, in both `test-kurtosis-assertoor.yml` (`build-erigon-image`) and `test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other pinned images. - Cache it under a single shared key (`docker-buildkit-<image>`) with the established docker save/load + actions/cache pattern, and `docker load` it before buildx setup. buildx's docker-container driver falls back to a locally present image when its pull fails ("pulling failed, using local image"), so with a warm cache the builder boots even while Docker Hub is fully unreachable. Pinning via driver-opts is what makes the fallback engage — the local image name must match what buildx wants to boot. - The cache-fill pull in the test jobs is best-effort (`continue-on-error`, save gated on pull success): buildx pulls the image itself either way, so a failed seed must not fail an otherwise-good run, and a failed pull never poisons the cache key with an empty archive. - Warm jobs (`warm-third-party-images` in both files) pull strictly and save the same key — producing the cache is their purpose. Both cache-warming workflows already path-filter on the edited files, so the cache is created on main right after this merges and refreshed daily against LRU eviction. In gloas, buildx setup previously ran before any caching; the buildkit cache steps are inserted ahead of it. Not covered (non-gating, can follow up if wanted): `ci-cd-main-branch-docker-images.yml` and `release.yml` also use `setup-buildx-action` but don't block PRs or the merge queue. actionlint is clean (the two SC2086 infos it reports pre-exist on main in the kurtosis CLI install step).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
test-kurtosis-gloas.ymlrestores thedocker-cl-*third-party image cache but nothing ever warms it on the default branch:pull_request+workflow_dispatchonly — nopush/merge_group/schedule.saveis gatedgithub.event_name != 'pull_request'(since ci: cache kurtosis infra images and retry engine bootstrap #21602), and its only non-PR trigger is manualworkflow_dispatch.So the cache is never systematically populated on the default branch, and gloas PR runs cold-pull all 7 third-party images from Docker Hub on ~every run — exposed to the same Docker Hub flakes as the assertoor workflow, protected only by the 3-attempt retry.
This is a different cache key from the assertoor warmer (#21695) — gloas's key omits
TEKU— so that warmer doesn't cover it. And unlike assertoor, gloas does not run inmerge_group, so a flake here fails a PR check rather than bouncing the merge queue — lower blast radius, same root gap.Fix
Mirror #21695 for gloas:
test-kurtosis-gloas.ymlcallable (workflow_call) with a newcl-images-onlyinput that runs only a newwarm-third-party-imagesjob; thegloas_testmatrix is gated off under it.actions/cache/restore@v5(lookup-only) + an explicitactions/cache/save@v5on miss — a ~10 s no-op when the cache already exists, pull + save only on a genuine miss.cache-warming-kurtosis-gloas-images.ymldrives it onpushtomain/release/**filtered onpaths: [test-kurtosis-gloas.yml](re-warm on version bumps) + a dailyschedule(repopulate after LRU eviction at the 500 GB cache ceiling) +workflow_dispatch.Net effect: gloas PR runs restore the cache from the default branch instead of pulling from Docker Hub.
Notes
test-kurtosis-assertoor.ymlandtest-kurtosis-gloas.yml(and now their two warmers). Unifying them into a single source so one warmer covers both is a sensible future cleanup — called out, not done here.Validation
actionlintclean on both workflows (the oneSC2086info is pre-existing, in an untouchedrun:block).cl-images-only, onlywarm-third-party-imagesruns; onpull_request/workflow_dispatch(no input),gloas_testruns as before. The PR-merge touchestest-kurtosis-gloas.yml, so thepathstrigger warms the cache on merge — no cold-start gap.