Skip to content

ci: warm kurtosis GLOAS image cache via dedicated paths + scheduled workflow#21703

Merged
taratorio merged 2 commits into
mainfrom
taratorio/kurtosis-warm-gloas-cl-images
Jun 10, 2026
Merged

ci: warm kurtosis GLOAS image cache via dedicated paths + scheduled workflow#21703
taratorio merged 2 commits into
mainfrom
taratorio/kurtosis-warm-gloas-cl-images

Conversation

@taratorio

Copy link
Copy Markdown
Member

Problem

test-kurtosis-gloas.yml restores the docker-cl-* third-party image cache but nothing ever warms it on the default branch:

So the cache is never systematically populated on the default branch, and gloas PR runs cold-pull all 7 third-party images from Docker Hub on ~every run — exposed to the same Docker Hub flakes as the assertoor workflow, protected only by the 3-attempt retry.

This is a different cache key from the assertoor warmer (#21695) — gloas's key omits TEKU — so that warmer doesn't cover it. And unlike assertoor, gloas does not run in merge_group, so a flake here fails a PR check rather than bouncing the merge queue — lower blast radius, same root gap.

Fix

Mirror #21695 for gloas:

  • Make test-kurtosis-gloas.yml callable (workflow_call) with a new cl-images-only input that runs only a new warm-third-party-images job; the gloas_test matrix is gated off under it.
  • The warm job uses actions/cache/restore@v5 (lookup-only) + an explicit actions/cache/save@v5 on miss — a ~10 s no-op when the cache already exists, pull + save only on a genuine miss.
  • New cache-warming-kurtosis-gloas-images.yml drives it on push to main/release/** filtered on paths: [test-kurtosis-gloas.yml] (re-warm on version bumps) + a daily schedule (repopulate after LRU eviction at the 500 GB cache ceiling) + workflow_dispatch.

Net effect: gloas PR runs restore the cache from the default branch instead of pulling from Docker Hub.

Notes

Validation

  • actionlint clean on both workflows (the one SC2086 info is pre-existing, in an untouched run: block).
  • Gating: with cl-images-only, only warm-third-party-images runs; on pull_request / workflow_dispatch (no input), gloas_test runs as before. The PR-merge touches test-kurtosis-gloas.yml, so the paths trigger warms the cache on merge — no cold-start gap.

@lystopad lystopad left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a CI gap where test-kurtosis-gloas.yml restores a docker-cl-* image cache but (due to triggers + save-gating) the cache is not systematically populated on the default branch, causing PR runs to frequently cold-pull third-party images from Docker Hub.

Changes:

  • Adds a reusable-workflow entrypoint to test-kurtosis-gloas.yml (workflow_call) with a cl-images-only input that runs a dedicated warm-third-party-images job and gates off the main GLOAS test matrix.
  • Implements cache probing via actions/cache/restore@v5 with lookup-only: true, pulling + saving images only when the cache key is missing.
  • Introduces a new scheduled/path-filtered workflow to warm the GLOAS cache on main/release/** and daily via schedule.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
.github/workflows/test-kurtosis-gloas.yml Adds workflow_call + cl-images-only gating and a dedicated cache-warming job that only pulls/saves on genuine cache misses.
.github/workflows/cache-warming-kurtosis-gloas-images.yml New driver workflow to populate the default-branch cache via push (paths-filtered), schedule, and manual dispatch.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@taratorio taratorio added this pull request to the merge queue Jun 10, 2026
Merged via the queue into main with commit d99c5e3 Jun 10, 2026
92 checks passed
@taratorio taratorio deleted the taratorio/kurtosis-warm-gloas-cl-images branch June 10, 2026 03:52
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 11, 2026
…rigontech#21741)

## Why

`docker/setup-buildx-action` boots BuildKit by pulling `moby/buildkit`
from Docker Hub on every run — the last uncached Docker Hub dependency
in the kurtosis jobs. In merge-queue run
[27280175556](https://github.com/erigontech/erigon/actions/runs/27280175556/job/80572198019)
that pull timed out (`Get "https://registry-1.docker.io/v2/": context
deadline exceeded`), failing the CI Gate for erigontech#21723 — ironically the PR
closing the equivalent gap for the kurtosis engine-bootstrap images. The
same Docker Hub connectivity window took out three other CI Gate runs
that morning at the `kurtosis engine start` step.

erigontech#21735 added a retry around buildx setup, but a retry doesn't survive an
outage longer than its window. This removes the hard dependency the same
way as the rest of the image-caching series (erigontech#21695, erigontech#21703, erigontech#21723).

## What

- Pin the BuildKit image as `BUILDKIT_IMAGE: moby/buildkit:v0.30.0`
(what the moving `buildx-stable-1` tag currently resolves to) and pass
it to `setup-buildx-action` via `driver-opts: image=...`, in both
`test-kurtosis-assertoor.yml` (`build-erigon-image`) and
`test-kurtosis-gloas.yml` (`gloas_test`). Bump alongside the other
pinned images.
- Cache it under a single shared key (`docker-buildkit-<image>`) with
the established docker save/load + actions/cache pattern, and `docker
load` it before buildx setup. buildx's docker-container driver falls
back to a locally present image when its pull fails ("pulling failed,
using local image"), so with a warm cache the builder boots even while
Docker Hub is fully unreachable. Pinning via driver-opts is what makes
the fallback engage — the local image name must match what buildx wants
to boot.
- The cache-fill pull in the test jobs is best-effort
(`continue-on-error`, save gated on pull success): buildx pulls the
image itself either way, so a failed seed must not fail an
otherwise-good run, and a failed pull never poisons the cache key with
an empty archive.
- Warm jobs (`warm-third-party-images` in both files) pull strictly and
save the same key — producing the cache is their purpose. Both
cache-warming workflows already path-filter on the edited files, so the
cache is created on main right after this merges and refreshed daily
against LRU eviction.

In gloas, buildx setup previously ran before any caching; the buildkit
cache steps are inserted ahead of it.

Not covered (non-gating, can follow up if wanted):
`ci-cd-main-branch-docker-images.yml` and `release.yml` also use
`setup-buildx-action` but don't block PRs or the merge queue.

actionlint is clean (the two SC2086 infos it reports pre-exist on main
in the kurtosis CLI install step).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants