Skip to content

ci: retry transient image-build/registry failures in hive CI gate#21504

Merged
yperbasis merged 1 commit into
mainfrom
yperbasis/hive-ci-retry
May 29, 2026
Merged

ci: retry transient image-build/registry failures in hive CI gate#21504
yperbasis merged 1 commit into
mainfrom
yperbasis/hive-ci-retry

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Why

Both merge-queue failures of #21374 were transient CI infrastructure blips on network-dependent steps that had no retry — not problems with the PR's code:

  • Docker Hub registry error building hive's hiveproxy image: Head https://registry-1.docker.io/v2/library/alpine/manifests/latest: unknown:Tests: , Failed: (zero tests ran).
  • github.com 403 during the in-builder erigon clone (fatal: unable to access 'https://github.com/erigontech/erigon/': 403). That specific path is already largely addressed by ci: build hive & hive-eest erigon clients from local source, not an ephemeral merge-queue ref #21447 (build erigon locally instead of cloning inside hive's builder), but the Docker Hub base-image pulls remain in both docker build and ./hive.

The merge-queue contract (per CI-GUIDELINES.md) is "a failure means the code is wrong — zero false positives." These infra blips are exactly the false positives that contract forbids, and they re-queue whole batches.

What

Add a small retry to the network-dependent steps in both test-hive.yml and test-hive-eest.yml:

  • Build steps (docker build of the local erigon image, go get, go build): wrapped in a retry() helper — 3 attempts, linear backoff.
  • ./hive run: retried only when too few tests were parsed (tests < 4) — the signature of a setup/image-build failure. A completed run (tests >= 4) is judged on its first result and never retried.

Why this is safe for the merge queue

  • A genuine test failure is never retried — only the fast infra-setup failure path is, so the retry cannot mask a real regression.
  • Because retries only trigger on the fast-fail (image build dying in seconds), added latency is seconds + backoff, not multiplied test runtime.
  • This is step-level resilience, not reliance on merge-queue re-runs (which CI-GUIDELINES.md explicitly discourages as a flake mask).

Testing

  • Verified the retry logic locally under bash -e -o pipefail (the shell GitHub uses for run: steps): infra-fail-then-recover → passes after retry; genuine failure → not retried, fails immediately; persistent infra failure → retries to max then fails.
  • actionlint clean — no new shellcheck findings (and removes one pre-existing SC2181).
  • make lint → 0 issues.

🤖 Generated with Claude Code

Both merge-queue failures of PR #21374 were transient infra blips on
network-dependent steps that had no retry: a Docker Hub registry error
building hive/hiveproxy (alpine pull), and a github.com 403 during the
in-builder erigon clone (the latter already addressed by #21447's
local-image build).

Wrap the Docker Hub / Go-proxy build steps (docker build of the local
erigon image, go get, go build) in a small retry helper, and retry the
./hive run only when too few tests were parsed -- the signature of a
setup/image-build failure, never a completed run. A genuine test result
(tests>=4) is judged on its first attempt, so the merge-queue contract
(a failure means the code is wrong) is preserved, and the retry path
only re-runs fast setup failures, not full test executions.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves resilience of the Hive CI gates by retrying transient network/image-build failures without retrying completed Hive test runs that meet the expected parsed-test threshold.

Changes:

  • Adds a small shell retry() helper around Docker image build and Hive Go dependency/build steps.
  • Adds retry loops around Hive execution when too few tests are parsed, preserving existing final failure checks.
  • Applies the same retry strategy to both regular Hive and Hive EEST workflows.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
.github/workflows/test-hive.yml Adds retries for local Erigon image build, Hive build, and too-few-tests Hive execution failures.
.github/workflows/test-hive-eest.yml Mirrors retry behavior for Hive EEST image/build/run steps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yperbasis yperbasis requested a review from taratorio May 29, 2026 09:05
@yperbasis yperbasis added the QA label May 29, 2026
@yperbasis yperbasis enabled auto-merge May 29, 2026 09:21
@yperbasis yperbasis requested a review from sudeepdino008 May 29, 2026 12:00
@yperbasis yperbasis added this pull request to the merge queue May 29, 2026
Merged via the queue into main with commit 80bba43 May 29, 2026
92 checks passed
@yperbasis yperbasis deleted the yperbasis/hive-ci-retry branch May 29, 2026 15:19
manusw7 pushed a commit to manusw7/erigon that referenced this pull request Jun 9, 2026
…lures (erigontech#21693)

## Problem

The `build-erigon-image` job in the kurtosis workflows
(`test-kurtosis-assertoor.yml`, `test-kurtosis-gloas.yml`)
intermittently fails on **transient infrastructure errors** unrelated to
the code — Docker Hub registry/auth blips and GitHub Actions
cache-backend hiccups. `docker/build-push-action` has no retry input, so
any such blip fails the whole job (and, in the merge queue, the run).

## Evidence (last ~30 days)

Scanning every failed `build-erigon-image` job across Cache Warming + CI
Gate runs, **4 of 15** failures were transient infra (the other 10 were
real compile breaks clustered on feature branches, plus one GitHub-CDN
action-download blip):

| Date | Branch | Signature |
|------|--------|-----------|
| 06-08 | main | Docker Hub `auth.docker.io/token` → `504 Gateway
Timeout` |
| 06-02 | main | `registry-1.docker.io … context deadline exceeded` |
| 06-02 | glamsterdam-devnet-4 | `registry-1.docker.io … request
canceled` (timeout) |
| 05-28 | main | GHA cache blob write 5xx
(`…blob.core.windows.net/actions-cache…`) |

That's ~3–4/month (≈ once a week), and a floor — flakes that someone
re-ran to green don't show as failed runs.

## Change

Retry the `docker/build-push-action` step once: the first attempt gets
an `id` + `continue-on-error: true`, and a second step re-runs the
identical build only `if: steps.build_erigon_image.outcome ==
'failure'`. The BuildKit layer cache (and the in-job builder) make the
retry cheap — it reuses the slow Go compile and only re-attempts
whatever flaked (pull / auth / cache export). Applied to both kurtosis
workflows.

## Why this approach

- `docker/build-push-action@v6` has no `retry` input (verified against
its `action.yml`), so retry must be external.
- Keeping the action — vs. converting to a raw `docker buildx build`
bash loop — preserves the auto-wired GHA cache runtime token and avoids
adding a third-party action (`crazy-max/ghaction-github-runtime`).
- Retrying the whole step covers cache-export failures too, not just
registry pulls.

## Tradeoff

A retry can't distinguish a transient flake from a real compile error,
so genuine build breaks now take 2 attempts before failing (~2× feedback
time on a broken build). The retry step has no `continue-on-error`, so
real breaks still go red — they are not masked.

## Validation

- `actionlint` clean for both files (the two `SC2086` infos it prints
are pre-existing, in untouched `run:` blocks).
- Behaviour: attempt 1 succeeds → retry skipped; attempt 1 fails +
attempt 2 succeeds → job green, image present for the downstream
artifact/test steps; both fail → job red.

## Precedent

Same pattern as erigontech#21604 (retry SonarCloud scan), erigontech#21602 (retry kurtosis
engine bootstrap), and erigontech#21504 (retry hive image-build/registry
failures).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants