ci: cache kurtosis infra images and retry engine bootstrap#21602
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the Kurtosis-based CI workflows against transient Docker Hub outages by pre-caching Kurtosis infrastructure images and moving Kurtosis engine bootstrap into an explicit, retryable step before the (non-retryable) composite action runs.
Changes:
- Pin the Kurtosis CLI version and pass it into the assertoor action to avoid silent CLI/engine drift.
- Extend Docker image caching to include Kurtosis infra images (engine/core/files-artifacts-expander/vector/fluent-bit) and add retry-with-backoff for cache-miss pulls.
- Add a dedicated “Install Kurtosis CLI and start engine” step with retries so engine bootstrap is no longer hidden mid-action.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| .github/workflows/test-kurtosis-gloas.yml | Adds pinned Kurtosis version + infra image caching, pull retries, and a retryable engine bootstrap step before running the assertoor action. |
| .github/workflows/test-kurtosis-assertoor.yml | Adds conditional Docker Hub login for the matrix job, pins Kurtosis version + infra image caching, pull retries, and a retryable engine bootstrap step before running the assertoor action. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
taratorio
approved these changes
Jun 3, 2026
This was referenced Jun 9, 2026
manusw7
pushed a commit
to manusw7/erigon
that referenced
this pull request
Jun 9, 2026
…lures (erigontech#21693) ## Problem The `build-erigon-image` job in the kurtosis workflows (`test-kurtosis-assertoor.yml`, `test-kurtosis-gloas.yml`) intermittently fails on **transient infrastructure errors** unrelated to the code — Docker Hub registry/auth blips and GitHub Actions cache-backend hiccups. `docker/build-push-action` has no retry input, so any such blip fails the whole job (and, in the merge queue, the run). ## Evidence (last ~30 days) Scanning every failed `build-erigon-image` job across Cache Warming + CI Gate runs, **4 of 15** failures were transient infra (the other 10 were real compile breaks clustered on feature branches, plus one GitHub-CDN action-download blip): | Date | Branch | Signature | |------|--------|-----------| | 06-08 | main | Docker Hub `auth.docker.io/token` → `504 Gateway Timeout` | | 06-02 | main | `registry-1.docker.io … context deadline exceeded` | | 06-02 | glamsterdam-devnet-4 | `registry-1.docker.io … request canceled` (timeout) | | 05-28 | main | GHA cache blob write 5xx (`…blob.core.windows.net/actions-cache…`) | That's ~3–4/month (≈ once a week), and a floor — flakes that someone re-ran to green don't show as failed runs. ## Change Retry the `docker/build-push-action` step once: the first attempt gets an `id` + `continue-on-error: true`, and a second step re-runs the identical build only `if: steps.build_erigon_image.outcome == 'failure'`. The BuildKit layer cache (and the in-job builder) make the retry cheap — it reuses the slow Go compile and only re-attempts whatever flaked (pull / auth / cache export). Applied to both kurtosis workflows. ## Why this approach - `docker/build-push-action@v6` has no `retry` input (verified against its `action.yml`), so retry must be external. - Keeping the action — vs. converting to a raw `docker buildx build` bash loop — preserves the auto-wired GHA cache runtime token and avoids adding a third-party action (`crazy-max/ghaction-github-runtime`). - Retrying the whole step covers cache-export failures too, not just registry pulls. ## Tradeoff A retry can't distinguish a transient flake from a real compile error, so genuine build breaks now take 2 attempts before failing (~2× feedback time on a broken build). The retry step has no `continue-on-error`, so real breaks still go red — they are not masked. ## Validation - `actionlint` clean for both files (the two `SC2086` infos it prints are pre-existing, in untouched `run:` blocks). - Behaviour: attempt 1 succeeds → retry skipped; attempt 1 fails + attempt 2 succeeds → job green, image present for the downstream artifact/test steps; both fail → job red. ## Precedent Same pattern as erigontech#21604 (retry SonarCloud scan), erigontech#21602 (retry kurtosis engine bootstrap), and erigontech#21504 (retry hive image-build/registry failures).
Lord1Egypt
pushed a commit
to Lord1Egypt/erigon
that referenced
this pull request
Jun 10, 2026
…orkflow (erigontech#21703) ## Problem `test-kurtosis-gloas.yml` restores the `docker-cl-*` third-party image cache but nothing ever warms it on the default branch: - It triggers on `pull_request` + `workflow_dispatch` only — no `push` / `merge_group` / `schedule`. - Its cache `save` is gated `github.event_name != 'pull_request'` (since erigontech#21602), and its only non-PR trigger is manual `workflow_dispatch`. So the cache is never systematically populated on the default branch, and gloas PR runs cold-pull all 7 third-party images from Docker Hub on ~every run — exposed to the same Docker Hub flakes as the assertoor workflow, protected only by the 3-attempt retry. This is a **different cache key** from the assertoor warmer (erigontech#21695) — gloas's key omits `TEKU` — so that warmer doesn't cover it. And unlike assertoor, gloas does **not** run in `merge_group`, so a flake here fails a PR check rather than bouncing the merge queue — lower blast radius, same root gap. ## Fix Mirror erigontech#21695 for gloas: - Make `test-kurtosis-gloas.yml` callable (`workflow_call`) with a new **`cl-images-only`** input that runs *only* a new `warm-third-party-images` job; the `gloas_test` matrix is gated off under it. - The warm job uses **`actions/cache/restore@v5` (`lookup-only`)** + an explicit **`actions/cache/save@v5`** on miss — a ~10 s no-op when the cache already exists, pull + save only on a genuine miss. - New **`cache-warming-kurtosis-gloas-images.yml`** drives it on `push` to `main`/`release/**` filtered on `paths: [test-kurtosis-gloas.yml]` (re-warm on version bumps) + a daily `schedule` (repopulate after LRU eviction at the 500 GB cache ceiling) + `workflow_dispatch`. Net effect: gloas PR runs restore the cache from the default branch instead of pulling from Docker Hub. ## Notes - Follow-up to erigontech#21695 (assertoor CL cache); same pattern, gloas's own key/image set (no teku). - The pinned image versions are duplicated across `test-kurtosis-assertoor.yml` and `test-kurtosis-gloas.yml` (and now their two warmers). Unifying them into a single source so one warmer covers both is a sensible future cleanup — called out, not done here. ## Validation - `actionlint` clean on both workflows (the one `SC2086` info is pre-existing, in an untouched `run:` block). - Gating: with `cl-images-only`, only `warm-third-party-images` runs; on `pull_request` / `workflow_dispatch` (no input), `gloas_test` runs as before. The PR-merge touches `test-kurtosis-gloas.yml`, so the `paths` trigger warms the cache on merge — no cold-start gap. Co-authored-by: lystopad <oleksandr.lystopad@erigon.tech>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In this caplin-minimal kurtosis job Docker Hub was unreachable from the runner: the Kurtosis engine bootstrap — which happens inside
kurtosis run, mid-action, where it cannot be retried — tried to pulltimberio/vector:0.45.0-debianand timed out onregistry-1.docker.io, failing the job before any test ran. The workflows already cache CL images (docker save / actions/cache / docker load) precisely to avoid Docker Hub exposure, but Kurtosis's own infrastructure images weren't covered.Fix
Applied to
test-kurtosis-assertoor.ymlandtest-kurtosis-gloas.yml.Take Docker Hub off the critical path
KURTOSIS_VERSION: 1.15.2and pass it to the assertoor action via itskurtosis_versioninput. Previously the action installed whatever apt.fury.io serves (its default islatest), so the CLI version — and with it the engine image tag — could drift silently.kurtosistech/engine,kurtosistech/core(APIC) andkurtosistech/files-artifacts-expander(all tagged with the CLI version),timberio/vector:0.45.0-debian(logs aggregator — the pull that failed) andfluent/fluent-bit:4.0.0(logs collector). Kurtosis uses themissingimage-download mode, so pre-loaded images are used without any registry call.Retry the cheap part
kurtosis engine stopbetween attempts. The action reuses a running engine, so engine bootstrap moves out of the un-retryable composite action into a retryable ~15 s step — a registry blip no longer costs a 20+ minute test step.docker pull3× with backoff.Notes
pull_requestevents, so oneworkflow_dispatchrun after merge warms it.ethereum-genesis-generator— version owned by the package branch, falls back to a normal pull), andqa-txpool-performance-test.yml(erigontech fork of the action on self-hosted runners with persistent local image caches).actionlint(same flags as the lint workflow) andshellcheckon the new run blocks; image names/tags verified against the kurtosis 1.15.2 sources.