ci: retry transient image-build/registry failures in hive CI gate#21504
Merged
Conversation
Both merge-queue failures of PR #21374 were transient infra blips on network-dependent steps that had no retry: a Docker Hub registry error building hive/hiveproxy (alpine pull), and a github.com 403 during the in-builder erigon clone (the latter already addressed by #21447's local-image build). Wrap the Docker Hub / Go-proxy build steps (docker build of the local erigon image, go get, go build) in a small retry helper, and retry the ./hive run only when too few tests were parsed -- the signature of a setup/image-build failure, never a completed run. A genuine test result (tests>=4) is judged on its first attempt, so the merge-queue contract (a failure means the code is wrong) is preserved, and the retry path only re-runs fast setup failures, not full test executions. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves resilience of the Hive CI gates by retrying transient network/image-build failures without retrying completed Hive test runs that meet the expected parsed-test threshold.
Changes:
- Adds a small shell
retry()helper around Docker image build and Hive Go dependency/build steps. - Adds retry loops around Hive execution when too few tests are parsed, preserving existing final failure checks.
- Applies the same retry strategy to both regular Hive and Hive EEST workflows.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
.github/workflows/test-hive.yml |
Adds retries for local Erigon image build, Hive build, and too-few-tests Hive execution failures. |
.github/workflows/test-hive-eest.yml |
Mirrors retry behavior for Hive EEST image/build/run steps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
taratorio
approved these changes
May 29, 2026
mriccobene
approved these changes
May 29, 2026
manusw7
pushed a commit
to manusw7/erigon
that referenced
this pull request
Jun 9, 2026
…lures (erigontech#21693) ## Problem The `build-erigon-image` job in the kurtosis workflows (`test-kurtosis-assertoor.yml`, `test-kurtosis-gloas.yml`) intermittently fails on **transient infrastructure errors** unrelated to the code — Docker Hub registry/auth blips and GitHub Actions cache-backend hiccups. `docker/build-push-action` has no retry input, so any such blip fails the whole job (and, in the merge queue, the run). ## Evidence (last ~30 days) Scanning every failed `build-erigon-image` job across Cache Warming + CI Gate runs, **4 of 15** failures were transient infra (the other 10 were real compile breaks clustered on feature branches, plus one GitHub-CDN action-download blip): | Date | Branch | Signature | |------|--------|-----------| | 06-08 | main | Docker Hub `auth.docker.io/token` → `504 Gateway Timeout` | | 06-02 | main | `registry-1.docker.io … context deadline exceeded` | | 06-02 | glamsterdam-devnet-4 | `registry-1.docker.io … request canceled` (timeout) | | 05-28 | main | GHA cache blob write 5xx (`…blob.core.windows.net/actions-cache…`) | That's ~3–4/month (≈ once a week), and a floor — flakes that someone re-ran to green don't show as failed runs. ## Change Retry the `docker/build-push-action` step once: the first attempt gets an `id` + `continue-on-error: true`, and a second step re-runs the identical build only `if: steps.build_erigon_image.outcome == 'failure'`. The BuildKit layer cache (and the in-job builder) make the retry cheap — it reuses the slow Go compile and only re-attempts whatever flaked (pull / auth / cache export). Applied to both kurtosis workflows. ## Why this approach - `docker/build-push-action@v6` has no `retry` input (verified against its `action.yml`), so retry must be external. - Keeping the action — vs. converting to a raw `docker buildx build` bash loop — preserves the auto-wired GHA cache runtime token and avoids adding a third-party action (`crazy-max/ghaction-github-runtime`). - Retrying the whole step covers cache-export failures too, not just registry pulls. ## Tradeoff A retry can't distinguish a transient flake from a real compile error, so genuine build breaks now take 2 attempts before failing (~2× feedback time on a broken build). The retry step has no `continue-on-error`, so real breaks still go red — they are not masked. ## Validation - `actionlint` clean for both files (the two `SC2086` infos it prints are pre-existing, in untouched `run:` blocks). - Behaviour: attempt 1 succeeds → retry skipped; attempt 1 fails + attempt 2 succeeds → job green, image present for the downstream artifact/test steps; both fail → job red. ## Precedent Same pattern as erigontech#21604 (retry SonarCloud scan), erigontech#21602 (retry kurtosis engine bootstrap), and erigontech#21504 (retry hive image-build/registry failures).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Both merge-queue failures of #21374 were transient CI infrastructure blips on network-dependent steps that had no retry — not problems with the PR's code:
hiveproxyimage:Head https://registry-1.docker.io/v2/library/alpine/manifests/latest: unknown:→Tests: , Failed:(zero tests ran).fatal: unable to access 'https://github.com/erigontech/erigon/': 403). That specific path is already largely addressed by ci: build hive & hive-eest erigon clients from local source, not an ephemeral merge-queue ref #21447 (build erigon locally instead of cloning inside hive's builder), but the Docker Hub base-image pulls remain in bothdocker buildand./hive.The merge-queue contract (per
CI-GUIDELINES.md) is "a failure means the code is wrong — zero false positives." These infra blips are exactly the false positives that contract forbids, and they re-queue whole batches.What
Add a small retry to the network-dependent steps in both
test-hive.ymlandtest-hive-eest.yml:docker buildof the local erigon image,go get,go build): wrapped in aretry()helper — 3 attempts, linear backoff../hiverun: retried only when too few tests were parsed (tests < 4) — the signature of a setup/image-build failure. A completed run (tests >= 4) is judged on its first result and never retried.Why this is safe for the merge queue
CI-GUIDELINES.mdexplicitly discourages as a flake mask).Testing
bash -e -o pipefail(the shell GitHub uses forrun:steps): infra-fail-then-recover → passes after retry; genuine failure → not retried, fails immediately; persistent infra failure → retries to max then fails.actionlintclean — no new shellcheck findings (and removes one pre-existing SC2181).make lint→ 0 issues.🤖 Generated with Claude Code