ci: build hive & hive-eest erigon clients from local source, not an ephemeral merge-queue ref#21447
Conversation
test-hive.yml built the hive erigon client via Dockerfile.git, which runs
`git clone --depth 1 --branch $tag` where $tag is the branch under test.
For merge_group events GITHUB_HEAD_REF is empty, so $tag became the
ephemeral gh-readonly-queue ref. When the queue re-forms the group or a
runner is slow to start, GitHub deletes that ref before the in-container
clone runs — the build fails ("Remote branch ... not found", exit 128),
all hive clients fail to build, and ci-gate fails.
Build the erigon image from the checked-out commit (actions/checkout
reliably provides the merge_group commit) and wrap it with hive's default
prebuilt-image client Dockerfile, mirroring the kurtosis build-erigon-image
job. No remote-ref dependency, so the race is gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hive-group runners reuse the checkout dir, and the previous sparse checkout (go.mod + hive-versions.json) left a stale .git/info/sparse-checkout in `erigon-src`. A full checkout into the same path didn't clear it, so the tree stayed sparse, cmd/erigon was missing, and the local erigon image build failed (cd ./cmd/erigon: No such file or directory). Use a fresh path that no sparse run has touched so every runner gets a complete tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test-hive.yml now builds the erigon image with a GHA build cache (same as the already-exempted test-kurtosis-* workflows), so it trips the cache-poisoning audit. The cache is scoped and feeds only ephemeral test images, never a published artifact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Manually dispatched Test Hive ( Run: https://github.com/erigontech/erigon/actions/runs/26518396881 This exercises the new path: full |
test-hive-eest.yml had the same sparse erigon-src checkout + clone-based Dockerfile.git mechanism as test-hive.yml, so it hit both the ephemeral-merge-queue-ref clone race and the stale-sparse-checkout issue on reused hive runners. Same fix: full erigon-full checkout + local docker build + wrap Hive's prebuilt-image client Dockerfile (the erigontech/hive fork's clients/erigon/Dockerfile carries the same baseimage/tag ARGs). Exempt it from the zizmor cache-poisoning rule like the other artifact-building workflows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Extended the same fix to test-hive-eest.yml (identical sparse- Manually dispatched Test hive-eest to validate it the same way (it's also draft-gated in CI Gate): For reference, the hive dispatch already proved the approach — its |
There was a problem hiding this comment.
Pull request overview
This PR hardens Hive-based CI jobs against merge-queue flakiness by removing the in-container git clone --branch <ephemeral gh-readonly-queue ref> dependency and instead building an Erigon image from the already checked-out workspace commit, then pointing Hive’s client Dockerfile at that locally built image.
Changes:
- Switch Hive and Hive-EEST workflows to full checkout of the Erigon repo and build
hive/erigon:cilocalfrom local source viadocker/build-push-action. - Update Hive client Dockerfile patching to reference the locally built base image/tag instead of swapping in
Dockerfile.gitand cloning from GitHub. - Exempt the updated workflows from zizmor’s cache-poisoning rule due to the introduced GHA BuildKit cache usage.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| .github/zizmor.yml | Adds test-hive.yml and test-hive-eest.yml to the cache-poisoning ignore list to accommodate new BuildKit GHA cache usage. |
| .github/workflows/test-hive.yml | Builds Erigon image from local checkout and repoints Hive’s client Dockerfile to the local image to avoid ephemeral merge-queue refs. |
| .github/workflows/test-hive-eest.yml | Applies the same local-image build approach to Hive-EEST and updates fixture sourcing paths accordingly. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Per Copilot review: docker/build-push-action with cache-to: type=gha made every matrix job a concurrent writer to one cache scope (shared across test-hive and test-hive-eest) — the "many writers to one type=gha scope" 504 failure mode that test-kurtosis-assertoor.yml centralizes its build to avoid. Use a plain `docker build` on the host daemon (persistent layer cache on the reused hive runners), matching the original clone-based build's caching. Drops the now-unneeded zizmor cache-poisoning exemptions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Copilot review: the `sed -i` that repoints clients/erigon/Dockerfile at hive/erigon:cilocal no-ops silently if upstream changes the ARG lines, which would leave hive using the remote erigontech/erigon image. Add a grep guard that exits non-zero if the rewrite didn't take. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Giulio2002
left a comment
There was a problem hiding this comment.
LGTM \u2014 obviously small/trivial change (108 lines).
…igontech#21504) ## Why Both merge-queue failures of erigontech#21374 were transient CI infrastructure blips on network-dependent steps that had **no retry** — not problems with the PR's code: - **Docker Hub registry error** building hive's `hiveproxy` image: `Head https://registry-1.docker.io/v2/library/alpine/manifests/latest: unknown:` → `Tests: , Failed:` (zero tests ran). - **github.com 403** during the in-builder erigon clone (`fatal: unable to access 'https://github.com/erigontech/erigon/': 403`). That specific path is already largely addressed by erigontech#21447 (build erigon locally instead of cloning inside hive's builder), but the Docker Hub base-image pulls remain in both `docker build` and `./hive`. The merge-queue contract (per `CI-GUIDELINES.md`) is "a failure means the code is wrong — zero false positives." These infra blips are exactly the false positives that contract forbids, and they re-queue whole batches. ## What Add a small retry to the network-dependent steps in both `test-hive.yml` and `test-hive-eest.yml`: - **Build steps** (`docker build` of the local erigon image, `go get`, `go build`): wrapped in a `retry()` helper — 3 attempts, linear backoff. - **`./hive` run**: retried **only when too few tests were parsed** (`tests < 4`) — the signature of a setup/image-build failure. A completed run (`tests >= 4`) is judged on its first result and never retried. ## Why this is safe for the merge queue - A **genuine test failure is never retried** — only the *fast* infra-setup failure path is, so the retry cannot mask a real regression. - Because retries only trigger on the fast-fail (image build dying in seconds), added latency is seconds + backoff, not multiplied test runtime. - This is step-level resilience, not reliance on merge-queue re-runs (which `CI-GUIDELINES.md` explicitly discourages as a flake mask). ## Testing - Verified the retry logic locally under `bash -e -o pipefail` (the shell GitHub uses for `run:` steps): infra-fail-then-recover → passes after retry; genuine failure → not retried, fails immediately; persistent infra failure → retries to max then fails. - `actionlint` clean — no new shellcheck findings (and removes one pre-existing SC2181). - `make lint` → 0 issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
In the merge queue, hive / hive-eest jobs intermittently fail with all clients failed to build:
test-hive.yml/test-hive-eest.ymlbuild the hive erigon client viaclients/erigon/Dockerfile.git, which runsgit clone --depth 1 --branch $tag https://github.com/$github. Formerge_groupeventsGITHUB_HEAD_REFis empty, so$tagfalls back to${GITHUB_REF#refs/heads/}= the ephemeralgh-readonly-queue/...ref. When the queue re-forms the group or a runner is slow to start, GitHub deletes that ref before the in-container clone runs → the clone fails. This evicts PRs from the merge queue (e.g. #21421), independent of the failing PR's code.actions/checkoutdoesn't hit this — GitHub reliably provides themerge_groupcommit to the run; only the independentgit clone --branch <ephemeral-ref>inside hive's builder races the ref's deletion.Fix
Build the erigon image from the checked-out commit and wrap it with hive's default prebuilt-image
Dockerfile(FROM $baseimage:$tag), in both hive workflows:erigon-fullpath. The old sparseerigon-srcpath leaves a stale.git/info/sparse-checkouton the reused self-hosted hive runners; a full checkout into that same path doesn't clear it, so the tree stays sparse,cmd/erigonis missing, and the build fails (cd ./cmd/erigon: No such file or directory). A fresh path no sparse run has touched avoids that.docker build -t hive/erigon:cilocalon the host daemon (persistent layer cache on the reused runners) — not a sharedtype=ghacache, so no concurrent-writer 504s and no cross-workflow scope contention (the failure modetest-kurtosis-assertoor.ymlcentralizes its build to avoid).baseimage=hive/erigon,tag=cilocal) instead ofmv-ing inDockerfile.git.SOURCE_REPO/branch_name/ clone-arg / builder-Go-version plumbing.-docker.pulldefaults to false (CI doesn't pass it), soFROM hive/erigon:cilocalresolves from the local daemon, never force-pulled. erigon's image isdebian:13-slimwitherigonon PATH, so hive's default Dockerfile (apt-get,erigon --version) layers on top cleanly — on bothethereum/hiveand theerigontech/hivefork used by hive-eest.