Skip to content

ci: cache kurtosis infra images and retry engine bootstrap#21602

Merged
taratorio merged 1 commit into
mainfrom
yperbasis/kurtosis-infra-image-cache
Jun 3, 2026
Merged

ci: cache kurtosis infra images and retry engine bootstrap#21602
taratorio merged 1 commit into
mainfrom
yperbasis/kurtosis-infra-image-cache

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Problem

In this caplin-minimal kurtosis job Docker Hub was unreachable from the runner: the Kurtosis engine bootstrap — which happens inside kurtosis run, mid-action, where it cannot be retried — tried to pull timberio/vector:0.45.0-debian and timed out on registry-1.docker.io, failing the job before any test ran. The workflows already cache CL images (docker save / actions/cache / docker load) precisely to avoid Docker Hub exposure, but Kurtosis's own infrastructure images weren't covered.

Fix

Applied to test-kurtosis-assertoor.yml and test-kurtosis-gloas.yml.

Take Docker Hub off the critical path

  • Pin KURTOSIS_VERSION: 1.15.2 and pass it to the assertoor action via its kurtosis_version input. Previously the action installed whatever apt.fury.io serves (its default is latest), so the CLI version — and with it the engine image tag — could drift silently.
  • Extend the cached image set with the five Kurtosis infra images, verified against the kurtosis 1.15.2 source: kurtosistech/engine, kurtosistech/core (APIC) and kurtosistech/files-artifacts-expander (all tagged with the CLI version), timberio/vector:0.45.0-debian (logs aggregator — the pull that failed) and fluent/fluent-bit:4.0.0 (logs collector). Kurtosis uses the missing image-download mode, so pre-loaded images are used without any registry call.

Retry the cheap part

  • New "Install Kurtosis CLI and start engine" step before the assertoor action: 3 attempts with backoff, kurtosis engine stop between attempts. The action reuses a running engine, so engine bootstrap moves out of the un-retryable composite action into a retryable ~15 s step — a registry blip no longer costs a 20+ minute test step.
  • The cache-miss pull step now retries each docker pull 3× with backoff.
  • Added the Conditional Docker Login to the assertoor matrix job — it was the only job pulling from Docker Hub anonymously on cache misses (the build job and gloas already log in).

Notes

  • First run after merge is a one-time cache miss on the new key (now with retries + authenticated pulls). The gloas workflow saves its cache only on non-pull_request events, so one workflow_dispatch run after merge warms it.
  • Out of scope: images pulled by ethereum-package itself (e.g. ethereum-genesis-generator — version owned by the package branch, falls back to a normal pull), and qa-txpool-performance-test.yml (erigontech fork of the action on self-hosted runners with persistent local image caches).
  • No TDD cycle: mechanical CI workflow change. Validated with actionlint (same flags as the lint workflow) and shellcheck on the new run blocks; image names/tags verified against the kurtosis 1.15.2 sources.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the Kurtosis-based CI workflows against transient Docker Hub outages by pre-caching Kurtosis infrastructure images and moving Kurtosis engine bootstrap into an explicit, retryable step before the (non-retryable) composite action runs.

Changes:

  • Pin the Kurtosis CLI version and pass it into the assertoor action to avoid silent CLI/engine drift.
  • Extend Docker image caching to include Kurtosis infra images (engine/core/files-artifacts-expander/vector/fluent-bit) and add retry-with-backoff for cache-miss pulls.
  • Add a dedicated “Install Kurtosis CLI and start engine” step with retries so engine bootstrap is no longer hidden mid-action.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
.github/workflows/test-kurtosis-gloas.yml Adds pinned Kurtosis version + infra image caching, pull retries, and a retryable engine bootstrap step before running the assertoor action.
.github/workflows/test-kurtosis-assertoor.yml Adds conditional Docker Hub login for the matrix job, pins Kurtosis version + infra image caching, pull retries, and a retryable engine bootstrap step before running the assertoor action.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@taratorio taratorio added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 222c794 Jun 3, 2026
90 checks passed
@taratorio taratorio deleted the yperbasis/kurtosis-infra-image-cache branch June 3, 2026 13:34
manusw7 pushed a commit to manusw7/erigon that referenced this pull request Jun 9, 2026
…lures (erigontech#21693)

## Problem

The `build-erigon-image` job in the kurtosis workflows
(`test-kurtosis-assertoor.yml`, `test-kurtosis-gloas.yml`)
intermittently fails on **transient infrastructure errors** unrelated to
the code — Docker Hub registry/auth blips and GitHub Actions
cache-backend hiccups. `docker/build-push-action` has no retry input, so
any such blip fails the whole job (and, in the merge queue, the run).

## Evidence (last ~30 days)

Scanning every failed `build-erigon-image` job across Cache Warming + CI
Gate runs, **4 of 15** failures were transient infra (the other 10 were
real compile breaks clustered on feature branches, plus one GitHub-CDN
action-download blip):

| Date | Branch | Signature |
|------|--------|-----------|
| 06-08 | main | Docker Hub `auth.docker.io/token` → `504 Gateway
Timeout` |
| 06-02 | main | `registry-1.docker.io … context deadline exceeded` |
| 06-02 | glamsterdam-devnet-4 | `registry-1.docker.io … request
canceled` (timeout) |
| 05-28 | main | GHA cache blob write 5xx
(`…blob.core.windows.net/actions-cache…`) |

That's ~3–4/month (≈ once a week), and a floor — flakes that someone
re-ran to green don't show as failed runs.

## Change

Retry the `docker/build-push-action` step once: the first attempt gets
an `id` + `continue-on-error: true`, and a second step re-runs the
identical build only `if: steps.build_erigon_image.outcome ==
'failure'`. The BuildKit layer cache (and the in-job builder) make the
retry cheap — it reuses the slow Go compile and only re-attempts
whatever flaked (pull / auth / cache export). Applied to both kurtosis
workflows.

## Why this approach

- `docker/build-push-action@v6` has no `retry` input (verified against
its `action.yml`), so retry must be external.
- Keeping the action — vs. converting to a raw `docker buildx build`
bash loop — preserves the auto-wired GHA cache runtime token and avoids
adding a third-party action (`crazy-max/ghaction-github-runtime`).
- Retrying the whole step covers cache-export failures too, not just
registry pulls.

## Tradeoff

A retry can't distinguish a transient flake from a real compile error,
so genuine build breaks now take 2 attempts before failing (~2× feedback
time on a broken build). The retry step has no `continue-on-error`, so
real breaks still go red — they are not masked.

## Validation

- `actionlint` clean for both files (the two `SC2086` infos it prints
are pre-existing, in untouched `run:` blocks).
- Behaviour: attempt 1 succeeds → retry skipped; attempt 1 fails +
attempt 2 succeeds → job green, image present for the downstream
artifact/test steps; both fail → job red.

## Precedent

Same pattern as erigontech#21604 (retry SonarCloud scan), erigontech#21602 (retry kurtosis
engine bootstrap), and erigontech#21504 (retry hive image-build/registry
failures).
Lord1Egypt pushed a commit to Lord1Egypt/erigon that referenced this pull request Jun 10, 2026
…orkflow (erigontech#21703)

## Problem

`test-kurtosis-gloas.yml` restores the `docker-cl-*` third-party image
cache but nothing ever warms it on the default branch:

- It triggers on `pull_request` + `workflow_dispatch` only — no `push` /
`merge_group` / `schedule`.
- Its cache `save` is gated `github.event_name != 'pull_request'` (since
erigontech#21602), and its only non-PR trigger is manual `workflow_dispatch`.

So the cache is never systematically populated on the default branch,
and gloas PR runs cold-pull all 7 third-party images from Docker Hub on
~every run — exposed to the same Docker Hub flakes as the assertoor
workflow, protected only by the 3-attempt retry.

This is a **different cache key** from the assertoor warmer (erigontech#21695) —
gloas's key omits `TEKU` — so that warmer doesn't cover it. And unlike
assertoor, gloas does **not** run in `merge_group`, so a flake here
fails a PR check rather than bouncing the merge queue — lower blast
radius, same root gap.

## Fix

Mirror erigontech#21695 for gloas:

- Make `test-kurtosis-gloas.yml` callable (`workflow_call`) with a new
**`cl-images-only`** input that runs *only* a new
`warm-third-party-images` job; the `gloas_test` matrix is gated off
under it.
- The warm job uses **`actions/cache/restore@v5` (`lookup-only`)** + an
explicit **`actions/cache/save@v5`** on miss — a ~10 s no-op when the
cache already exists, pull + save only on a genuine miss.
- New **`cache-warming-kurtosis-gloas-images.yml`** drives it on `push`
to `main`/`release/**` filtered on `paths: [test-kurtosis-gloas.yml]`
(re-warm on version bumps) + a daily `schedule` (repopulate after LRU
eviction at the 500 GB cache ceiling) + `workflow_dispatch`.

Net effect: gloas PR runs restore the cache from the default branch
instead of pulling from Docker Hub.

## Notes

- Follow-up to erigontech#21695 (assertoor CL cache); same pattern, gloas's own
key/image set (no teku).
- The pinned image versions are duplicated across
`test-kurtosis-assertoor.yml` and `test-kurtosis-gloas.yml` (and now
their two warmers). Unifying them into a single source so one warmer
covers both is a sensible future cleanup — called out, not done here.

## Validation

- `actionlint` clean on both workflows (the one `SC2086` info is
pre-existing, in an untouched `run:` block).
- Gating: with `cl-images-only`, only `warm-third-party-images` runs; on
`pull_request` / `workflow_dispatch` (no input), `gloas_test` runs as
before. The PR-merge touches `test-kurtosis-gloas.yml`, so the `paths`
trigger warms the cache on merge — no cold-start gap.

Co-authored-by: lystopad <oleksandr.lystopad@erigon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants