ci: cache kurtosis engine-bootstrap images (curl-jq, traefik, alpine)#21723
Merged
Conversation
kurtosis engine start launches three helper containers beyond the engine image itself: the logs-aggregator healthcheck (badouralix/curl-jq), the reverse proxy (traefik) and a volume-init container (alpine). These were the only engine-start images missing from the docker-cl-* cache, so an engine start still required Docker Hub and died when registry egress was down — exhausting all three retries, e.g. https://github.com/erigontech/erigon/actions/runs/27271494252/job/80543276225 Pre-loaded images are used without pulling (image download mode "missing"), so caching these three makes engine start fully cache-served. The cache-warming workflows path-filter on these files, so merging this re-warms the new key on main automatically.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the robustness of the Kurtosis-based CI workflows by extending the existing Docker image cache to include additional Kurtosis engine-bootstrap helper images, reducing test flakiness when Docker Hub connectivity is unreliable.
Changes:
- Add env vars for three Kurtosis engine-bootstrap helper images (curl-jq, traefik, alpine).
- Include these images in the pull →
docker save→ cache →docker loadpipeline. - Extend the
docker-cl-*cache keys to incorporate the additional images.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| .github/workflows/test-kurtosis-gloas.yml | Cache Kurtosis engine-bootstrap helper images and extend cache keys/restore/load steps accordingly. |
| .github/workflows/test-kurtosis-assertoor.yml | Cache Kurtosis engine-bootstrap helper images and extend cache keys/restore/load steps accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses Copilot review: cached `latest` is intentionally frozen at warm time and refreshed by any cache-key change, not tracked live.
taratorio
approved these changes
Jun 10, 2026
lystopad
approved these changes
Jun 10, 2026
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
Jun 10, 2026
In merge-queue run [27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019) `kurtosis / build-erigon-image` failed in the "Set up Docker Buildx" step: booting BuildKit pulls `moby/buildkit:buildx-stable-1` from Docker Hub, and the pull timed out (`Get "https://registry-1.docker.io/v2/": context deadline exceeded`), failing the CI Gate for erigontech#21723. The job already retries the erigon image build on transient failures, but the buildx setup step runs before that retry and wasn't covered. This applies the same pattern: the first setup attempt is `continue-on-error`, and a second attempt runs only if the first failed. Both attempts failing still fails the job. Co-authored-by: noop <noop@noop>
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
Jun 11, 2026
…rigontech#21741) ## Why `docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit` from Docker Hub on every run — the last uncached Docker Hub dependency in the kurtosis jobs. In merge-queue run [27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019) that pull timed out (`Get "https://registry-1.docker.io/v2/": context deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR closing the equivalent gap for the kurtosis engine-bootstrap images. The same Docker Hub connectivity window took out three other CI Gate runs that morning at the `kurtosis engine start` step. erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an outage longer than its window. This removes the hard dependency the same way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723). ## What - Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0` (what the moving `buildx-stable-1` tag currently resolves to) and pass it to `setup-buildx-action` via `driver-opts: image=...`, in both `test-kurtosis-assertoor.yml` (`build-erigon-image`) and `test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other pinned images. - Cache it under a single shared key (`docker-buildkit-<image>`) with the established docker save/load + actions/cache pattern, and `docker load` it before buildx setup. buildx's docker-container driver falls back to a locally present image when its pull fails ("pulling failed, using local image"), so with a warm cache the builder boots even while Docker Hub is fully unreachable. Pinning via driver-opts is what makes the fallback engage — the local image name must match what buildx wants to boot. - The cache-fill pull in the test jobs is best-effort (`continue-on-error`, save gated on pull success): buildx pulls the image itself either way, so a failed seed must not fail an otherwise-good run, and a failed pull never poisons the cache key with an empty archive. - Warm jobs (`warm-third-party-images` in both files) pull strictly and save the same key — producing the cache is their purpose. Both cache-warming workflows already path-filter on the edited files, so the cache is created on main right after this merges and refreshed daily against LRU eviction. In gloas, buildx setup previously ran before any caching; the buildkit cache steps are inserted ahead of it. Not covered (non-gating, can follow up if wanted): `ci-cd-main-branch-docker-images.yml` and `release.yml` also use `setup-buildx-action` but don't block PRs or the merge queue. actionlint is clean (the two SC2086 infos it reports pre-exist on main in the kurtosis CLI install step).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
kurtosis / assertoor_regular_serial_testfailed on #21646 (job link): the runner lost Docker Hub connectivity for the entire job (even the Docker login step timed out againstregistry-1.docker.io), andkurtosis engine startexhausted all 3 retries failing to pullbadouralix/curl-jq:latest— the logs-aggregator healthcheck container. A sibling job on a different runner passed at the same time, so this was per-runner registry egress, i.e. exactly the class of flake the cached-image setup exists to absorb.Root cause
The
docker-cl-*cache covers the CL images plus kurtosis engine/core/expander/vector/fluent-bit, butkurtosis engine startlaunches three more helper containers, none of them cached:badouralix/curl-jq:latest— logs-aggregator healthcheck (logs_aggregator_functions/shared_helpers.goin kurtosis-tech/kurtosis)traefik:2.10.6— reverse proxy (reverse_proxy_functions/implementations/traefik/consts.go)alpine:3.17— volume-init helper (engine_functions/logs_collector_functions)A passing run's log confirms these are the only three images pulled at engine start (every cached image shows no pull line), so the engine-start step — whose stated purpose is to keep registry blips out of the test step — still had a hard Docker Hub dependency.
Fix
Add the three images to the existing pull →
docker save→ cache →docker loadpipeline and cache keys in bothtest-kurtosis-assertoor.ymlandtest-kurtosis-gloas.yml, following the established vector/fluent-bit pattern. Kurtosis uses image download mode "missing", so pre-loaded images are used without contacting the registry (verified in the passing run: vector/fluent-bit are started without pull lines). After this,kurtosis engine startis fully cache-served.Rollout
cache-warming-kurtosis-cl-images.ymlandcache-warming-kurtosis-gloas-images.yml(paths filters) which re-warm the main-scope cache under the new key. No manual action needed.Residual exposure (out of scope)
Enclave-time images with intentionally mutable tags (
ethpandaops/ethereum-genesis-generator,rpc-snooper:latest,spamoor:master, suite-specific CL devnet tags, …) are still pulled from Docker Hub duringkurtosis run. Pinning/caching those would change test semantics (they deliberately track moving tags), so they stay as-is.Validation
actionlint: no new findings (the two pre-existing SC2086 infos in the apt-get line are unchanged)make lint: 0 issues