Skip to content

ci: cache kurtosis engine-bootstrap images (curl-jq, traefik, alpine)#21723

Merged
yperbasis merged 2 commits into
mainfrom
yperbasis/kurtosis-engine-image-cache
Jun 10, 2026
Merged

ci: cache kurtosis engine-bootstrap images (curl-jq, traefik, alpine)#21723
yperbasis merged 2 commits into
mainfrom
yperbasis/kurtosis-engine-image-cache

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Problem

kurtosis / assertoor_regular_serial_test failed on #21646 (job link): the runner lost Docker Hub connectivity for the entire job (even the Docker login step timed out against registry-1.docker.io), and kurtosis engine start exhausted all 3 retries failing to pull badouralix/curl-jq:latest — the logs-aggregator healthcheck container. A sibling job on a different runner passed at the same time, so this was per-runner registry egress, i.e. exactly the class of flake the cached-image setup exists to absorb.

Root cause

The docker-cl-* cache covers the CL images plus kurtosis engine/core/expander/vector/fluent-bit, but kurtosis engine start launches three more helper containers, none of them cached:

  • badouralix/curl-jq:latest — logs-aggregator healthcheck (logs_aggregator_functions/shared_helpers.go in kurtosis-tech/kurtosis)
  • traefik:2.10.6 — reverse proxy (reverse_proxy_functions/implementations/traefik/consts.go)
  • alpine:3.17 — volume-init helper (engine_functions/logs_collector_functions)

A passing run's log confirms these are the only three images pulled at engine start (every cached image shows no pull line), so the engine-start step — whose stated purpose is to keep registry blips out of the test step — still had a hard Docker Hub dependency.

Fix

Add the three images to the existing pull → docker save → cache → docker load pipeline and cache keys in both test-kurtosis-assertoor.yml and test-kurtosis-gloas.yml, following the established vector/fluent-bit pattern. Kurtosis uses image download mode "missing", so pre-loaded images are used without contacting the registry (verified in the passing run: vector/fluent-bit are started without pull lines). After this, kurtosis engine start is fully cache-served.

Rollout

  • The cache key changes, so this PR's own kurtosis runs cold-pull once and exercise the new pull/save path.
  • On merge, the push touching these files triggers cache-warming-kurtosis-cl-images.yml and cache-warming-kurtosis-gloas-images.yml (paths filters) which re-warm the main-scope cache under the new key. No manual action needed.

Residual exposure (out of scope)

Enclave-time images with intentionally mutable tags (ethpandaops/ethereum-genesis-generator, rpc-snooper:latest, spamoor:master, suite-specific CL devnet tags, …) are still pulled from Docker Hub during kurtosis run. Pinning/caching those would change test semantics (they deliberately track moving tags), so they stay as-is.

Validation

  • actionlint: no new findings (the two pre-existing SC2086 infos in the apt-get line are unchanged)
  • YAML parse clean; resolved cache key ≈ 209 chars (limit 512)
  • make lint: 0 issues

kurtosis engine start launches three helper containers beyond the
engine image itself: the logs-aggregator healthcheck
(badouralix/curl-jq), the reverse proxy (traefik) and a volume-init
container (alpine). These were the only engine-start images missing
from the docker-cl-* cache, so an engine start still required Docker
Hub and died when registry egress was down — exhausting all three
retries, e.g.
https://github.com/erigontech/erigon/actions/runs/27271494252/job/80543276225

Pre-loaded images are used without pulling (image download mode
"missing"), so caching these three makes engine start fully
cache-served. The cache-warming workflows path-filter on these files,
so merging this re-warms the new key on main automatically.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the robustness of the Kurtosis-based CI workflows by extending the existing Docker image cache to include additional Kurtosis engine-bootstrap helper images, reducing test flakiness when Docker Hub connectivity is unreliable.

Changes:

  • Add env vars for three Kurtosis engine-bootstrap helper images (curl-jq, traefik, alpine).
  • Include these images in the pull → docker save → cache → docker load pipeline.
  • Extend the docker-cl-* cache keys to incorporate the additional images.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
.github/workflows/test-kurtosis-gloas.yml Cache Kurtosis engine-bootstrap helper images and extend cache keys/restore/load steps accordingly.
.github/workflows/test-kurtosis-assertoor.yml Cache Kurtosis engine-bootstrap helper images and extend cache keys/restore/load steps accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/test-kurtosis-gloas.yml
Comment thread .github/workflows/test-kurtosis-assertoor.yml
Addresses Copilot review: cached `latest` is intentionally frozen at
warm time and refreshed by any cache-key change, not tracked live.
@yperbasis yperbasis enabled auto-merge June 10, 2026 12:25
@yperbasis yperbasis added this pull request to the merge queue Jun 10, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 10, 2026
@yperbasis yperbasis added this pull request to the merge queue Jun 10, 2026
Merged via the queue into main with commit cfd0de5 Jun 10, 2026
179 of 181 checks passed
@yperbasis yperbasis deleted the yperbasis/kurtosis-engine-image-cache branch June 10, 2026 15:25
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 10, 2026
In merge-queue run
[27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019)
`kurtosis / build-erigon-image` failed in the "Set up Docker Buildx"
step: booting BuildKit pulls `moby/buildkit:buildx-stable-1` from Docker
Hub, and the pull timed out (`Get "https://registry-1.docker.io/v2/":
context deadline exceeded`), failing the CI Gate for erigontech#21723.

The job already retries the erigon image build on transient failures,
but the buildx setup step runs before that retry and wasn't covered.
This applies the same pattern: the first setup attempt is
`continue-on-error`, and a second attempt runs only if the first failed.
Both attempts failing still fails the job.

Co-authored-by: noop <noop@noop>
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 11, 2026
…rigontech#21741)

## Why

`docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit`
from Docker Hub on every run — the last uncached Docker Hub dependency
in the kurtosis jobs. In merge-queue run
[27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019)
that pull timed out (`Get "https://registry-1.docker.io/v2/": context
deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR
closing the equivalent gap for the kurtosis engine-bootstrap images. The
same Docker Hub connectivity window took out three other CI Gate runs
that morning at the `kurtosis engine start` step.

erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an
outage longer than its window. This removes the hard dependency the same
way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723).

## What

- Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0`
(what the moving `buildx-stable-1` tag currently resolves to) and pass
it to `setup-buildx-action` via `driver-opts: image=...`, in both
`test-kurtosis-assertoor.yml` (`build-erigon-image`) and
`test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other
pinned images.
- Cache it under a single shared key (`docker-buildkit-<image>`) with
the established docker save/load + actions/cache pattern, and `docker
load` it before buildx setup. buildx's docker-container driver falls
back to a locally present image when its pull fails ("pulling failed,
using local image"), so with a warm cache the builder boots even while
Docker Hub is fully unreachable. Pinning via driver-opts is what makes
the fallback engage — the local image name must match what buildx wants
to boot.
- The cache-fill pull in the test jobs is best-effort
(`continue-on-error`, save gated on pull success): buildx pulls the
image itself either way, so a failed seed must not fail an
otherwise-good run, and a failed pull never poisons the cache key with
an empty archive.
- Warm jobs (`warm-third-party-images` in both files) pull strictly and
save the same key — producing the cache is their purpose. Both
cache-warming workflows already path-filter on the edited files, so the
cache is created on main right after this merges and refreshed daily
against LRU eviction.

In gloas, buildx setup previously ran before any caching; the buildkit
cache steps are inserted ahead of it.

Not covered (non-gating, can follow up if wanted):
`ci-cd-main-branch-docker-images.yml` and `release.yml` also use
`setup-buildx-action` but don't block PRs or the merge queue.

actionlint is clean (the two SC2086 infos it reports pre-exist on main
in the kurtosis CLI install step).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants