Skip to content

ci: warm kurtosis third-party image cache on default branch for merge-queue runs#21695

Merged
taratorio merged 3 commits into
mainfrom
taratorio/kurtosis-warm-cl-image-cache
Jun 10, 2026
Merged

ci: warm kurtosis third-party image cache on default branch for merge-queue runs#21695
taratorio merged 3 commits into
mainfrom
taratorio/kurtosis-warm-cl-image-cache

Conversation

@taratorio

@taratorio taratorio commented Jun 9, 2026

Copy link
Copy Markdown
Member

Problem

The kurtosis matrix's third-party images (lighthouse, teku, assertoor, kurtosis engine/core/expander, vector, fluent-bit) are cached via actions/cache, but that cache is only ever written in PR scope, never on the default branch — so the merge queue gets no protection from it.

Why: cache-warming.yml calls this workflow with cache-warming-only: true, which skips the assertoor_test matrix job — the only job that pulls + saves those images. So docker-cl-* is never saved to refs/heads/main. (Every existing docker-cl-* cache entry is scoped to refs/pull/NNNNN/merge.) Merge-queue runs execute on ephemeral gh-readonly-queue/main/* branches and can only restore caches from the default branch — so every merge-queue run misses and pulls all 8 images from Docker Hub, fully exposed to Docker Hub flakes on the path that gates merges.

Example failure

https://github.com/erigontech/erigon/actions/runs/27067265745/job/79891001617 — a merge-queue run for #21659. The assertoor_caplin-minimal_parallel_test shard missed the cache, tried to pull sigp/lighthouse:v7.0.1, hit registry-1.docker.io … context deadline exceeded on all 3 retry attempts, and fast-cancelled the whole merge-group run.

Fix

Warm the docker-cl-* cache on the default branch via a dedicated workflow (warm-kurtosis-cl-images.yml):

  • Triggers: push to main/release/** filtered on paths: [test-kurtosis-assertoor.yml] — the cache key is derived from the pinned image versions, which live in that file, so it only needs re-warming on a version bump — plus a daily schedule to repopulate the cache if it's LRU-evicted between bumps (the repo sits at the 500 GB cache ceiling, so eviction is active).
  • It calls test-kurtosis-assertoor.yml with a new cl-images-only input that runs only the warm job — build-erigon-image and the test matrix are gated off, so the schedule/paths runs don't rebuild the image or run tests.
  • The warm job uses actions/cache/restore@v5 with lookup-only: true + an explicit actions/cache/save@v5 on miss: when the cache already exists it's a ~10 s no-op (no download), and it only pulls + saves on a genuine miss.

Net effect: merge-queue and first-PR runs restore the CL cache from the default branch instead of pulling from Docker Hub.

Scope

  • build-erigon-image is intentionally left on its every-push cadence. Its BuildKit layer cache is source-dependent (the base + go mod download layers track Dockerfile/go.mod/go.sum, the compile layer changes every commit), a different concern from these static version-pinned images. Optimizing its warming (paths on Dockerfile/go.mod/go.sum + a deps-stage split so the warm skips the compile) is a possible follow-up, not in this PR.
  • Complementary to ci: retry kurtosis erigon image build on transient registry/cache failures #21693 (retry on the erigon image build) — a different Docker Hub touchpoint.

Validation

  • actionlint clean on both workflows (the one SC2086 info is pre-existing, in an untouched run: block).
  • Gating verified: with cl-images-only, only warm-third-party-images runs; cache-warming.yml (cache-warming-only) still warms build-erigon-image every push; PR/merge_group still run the full matrix and now restore the default-branch CL cache.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Kurtosis Assertoor reusable workflow to ensure the docker-cl-* third-party image cache is written on the default branch during cache-warming runs, so merge-queue runs can reliably restore it and avoid pulling these images from Docker Hub.

Changes:

  • Add a warm-third-party-images job that runs only when inputs.cache-warming-only is true.
  • Restore (and, on miss, populate) the docker-cl-* cache by pulling and docker save-ing the pinned third-party images into /tmp/docker-cache.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yperbasis yperbasis added the QA label Jun 9, 2026
@lystopad

lystopad commented Jun 9, 2026

Copy link
Copy Markdown
Member

LGTM.

@lystopad lystopad added this pull request to the merge queue Jun 9, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 9, 2026
@taratorio taratorio added this pull request to the merge queue Jun 10, 2026
@taratorio taratorio requested a review from AskAlexSharov June 10, 2026 02:57
Merged via the queue into main with commit b81f8f2 Jun 10, 2026
90 of 91 checks passed
@taratorio taratorio deleted the taratorio/kurtosis-warm-cl-image-cache branch June 10, 2026 03:33
Lord1Egypt pushed a commit to Lord1Egypt/erigon that referenced this pull request Jun 10, 2026
…orkflow (erigontech#21703)

## Problem

`test-kurtosis-gloas.yml` restores the `docker-cl-*` third-party image
cache but nothing ever warms it on the default branch:

- It triggers on `pull_request` + `workflow_dispatch` only — no `push` /
`merge_group` / `schedule`.
- Its cache `save` is gated `github.event_name != 'pull_request'` (since
erigontech#21602), and its only non-PR trigger is manual `workflow_dispatch`.

So the cache is never systematically populated on the default branch,
and gloas PR runs cold-pull all 7 third-party images from Docker Hub on
~every run — exposed to the same Docker Hub flakes as the assertoor
workflow, protected only by the 3-attempt retry.

This is a **different cache key** from the assertoor warmer (erigontech#21695) —
gloas's key omits `TEKU` — so that warmer doesn't cover it. And unlike
assertoor, gloas does **not** run in `merge_group`, so a flake here
fails a PR check rather than bouncing the merge queue — lower blast
radius, same root gap.

## Fix

Mirror erigontech#21695 for gloas:

- Make `test-kurtosis-gloas.yml` callable (`workflow_call`) with a new
**`cl-images-only`** input that runs *only* a new
`warm-third-party-images` job; the `gloas_test` matrix is gated off
under it.
- The warm job uses **`actions/cache/restore@v5` (`lookup-only`)** + an
explicit **`actions/cache/save@v5`** on miss — a ~10 s no-op when the
cache already exists, pull + save only on a genuine miss.
- New **`cache-warming-kurtosis-gloas-images.yml`** drives it on `push`
to `main`/`release/**` filtered on `paths: [test-kurtosis-gloas.yml]`
(re-warm on version bumps) + a daily `schedule` (repopulate after LRU
eviction at the 500 GB cache ceiling) + `workflow_dispatch`.

Net effect: gloas PR runs restore the cache from the default branch
instead of pulling from Docker Hub.

## Notes

- Follow-up to erigontech#21695 (assertoor CL cache); same pattern, gloas's own
key/image set (no teku).
- The pinned image versions are duplicated across
`test-kurtosis-assertoor.yml` and `test-kurtosis-gloas.yml` (and now
their two warmers). Unifying them into a single source so one warmer
covers both is a sensible future cleanup — called out, not done here.

## Validation

- `actionlint` clean on both workflows (the one `SC2086` info is
pre-existing, in an untouched `run:` block).
- Gating: with `cl-images-only`, only `warm-third-party-images` runs; on
`pull_request` / `workflow_dispatch` (no input), `gloas_test` runs as
before. The PR-merge touches `test-kurtosis-gloas.yml`, so the `paths`
trigger warms the cache on merge — no cold-start gap.

Co-authored-by: lystopad <oleksandr.lystopad@erigon.tech>
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 11, 2026
…rigontech#21741)

## Why

`docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit`
from Docker Hub on every run — the last uncached Docker Hub dependency
in the kurtosis jobs. In merge-queue run
[27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019)
that pull timed out (`Get "https://registry-1.docker.io/v2/": context
deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR
closing the equivalent gap for the kurtosis engine-bootstrap images. The
same Docker Hub connectivity window took out three other CI Gate runs
that morning at the `kurtosis engine start` step.

erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an
outage longer than its window. This removes the hard dependency the same
way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723).

## What

- Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0`
(what the moving `buildx-stable-1` tag currently resolves to) and pass
it to `setup-buildx-action` via `driver-opts: image=...`, in both
`test-kurtosis-assertoor.yml` (`build-erigon-image`) and
`test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other
pinned images.
- Cache it under a single shared key (`docker-buildkit-<image>`) with
the established docker save/load + actions/cache pattern, and `docker
load` it before buildx setup. buildx's docker-container driver falls
back to a locally present image when its pull fails ("pulling failed,
using local image"), so with a warm cache the builder boots even while
Docker Hub is fully unreachable. Pinning via driver-opts is what makes
the fallback engage — the local image name must match what buildx wants
to boot.
- The cache-fill pull in the test jobs is best-effort
(`continue-on-error`, save gated on pull success): buildx pulls the
image itself either way, so a failed seed must not fail an
otherwise-good run, and a failed pull never poisons the cache key with
an empty archive.
- Warm jobs (`warm-third-party-images` in both files) pull strictly and
save the same key — producing the cache is their purpose. Both
cache-warming workflows already path-filter on the edited files, so the
cache is created on main right after this merges and refreshed daily
against LRU eviction.

In gloas, buildx setup previously ran before any caching; the buildkit
cache steps are inserted ahead of it.

Not covered (non-gating, can follow up if wanted):
`ci-cd-main-branch-docker-images.yml` and `release.yml` also use
`setup-buildx-action` but don't block PRs or the merge queue.

actionlint is clean (the two SC2086 infos it reports pre-exist on main
in the kurtosis CLI install step).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants