Skip to content

ci: build hive & hive-eest erigon clients from local source, not an ephemeral merge-queue ref#21447

Merged
yperbasis merged 9 commits into
mainfrom
yperbasis/hive-local-build
May 28, 2026
Merged

ci: build hive & hive-eest erigon clients from local source, not an ephemeral merge-queue ref#21447
yperbasis merged 9 commits into
mainfrom
yperbasis/hive-local-build

Conversation

@yperbasis

@yperbasis yperbasis commented May 27, 2026

Copy link
Copy Markdown
Member

Problem

In the merge queue, hive / hive-eest jobs intermittently fail with all clients failed to build:

Cloning: erigontech/erigon - gh-readonly-queue/main/pr-XXXXX-<sha>
fatal: Remote branch gh-readonly-queue/main/pr-XXXXX-<sha> not found in upstream origin
→ image build failed (128) → "too few tests" → ci-gate failure

test-hive.yml / test-hive-eest.yml build the hive erigon client via clients/erigon/Dockerfile.git, which runs git clone --depth 1 --branch $tag https://github.com/$github. For merge_group events GITHUB_HEAD_REF is empty, so $tag falls back to ${GITHUB_REF#refs/heads/} = the ephemeral gh-readonly-queue/... ref. When the queue re-forms the group or a runner is slow to start, GitHub deletes that ref before the in-container clone runs → the clone fails. This evicts PRs from the merge queue (e.g. #21421), independent of the failing PR's code.

actions/checkout doesn't hit this — GitHub reliably provides the merge_group commit to the run; only the independent git clone --branch <ephemeral-ref> inside hive's builder races the ref's deletion.

Fix

Build the erigon image from the checked-out commit and wrap it with hive's default prebuilt-image Dockerfile (FROM $baseimage:$tag), in both hive workflows:

  • check out the full erigon source into a fresh erigon-full path. The old sparse erigon-src path leaves a stale .git/info/sparse-checkout on the reused self-hosted hive runners; a full checkout into that same path doesn't clear it, so the tree stays sparse, cmd/erigon is missing, and the build fails (cd ./cmd/erigon: No such file or directory). A fresh path no sparse run has touched avoids that.
  • plain docker build -t hive/erigon:cilocal on the host daemon (persistent layer cache on the reused runners) — not a shared type=gha cache, so no concurrent-writer 504s and no cross-workflow scope contention (the failure mode test-kurtosis-assertoor.yml centralizes its build to avoid).
  • point hive's default client Dockerfile at that local image (baseimage=hive/erigon, tag=cilocal) instead of mv-ing in Dockerfile.git.
  • drop the now-unused SOURCE_REPO / branch_name / clone-arg / builder-Go-version plumbing.

-docker.pull defaults to false (CI doesn't pass it), so FROM hive/erigon:cilocal resolves from the local daemon, never force-pulled. erigon's image is debian:13-slim with erigon on PATH, so hive's default Dockerfile (apt-get, erigon --version) layers on top cleanly — on both ethereum/hive and the erigontech/hive fork used by hive-eest.

test-hive.yml built the hive erigon client via Dockerfile.git, which runs
`git clone --depth 1 --branch $tag` where $tag is the branch under test.
For merge_group events GITHUB_HEAD_REF is empty, so $tag became the
ephemeral gh-readonly-queue ref. When the queue re-forms the group or a
runner is slow to start, GitHub deletes that ref before the in-container
clone runs — the build fails ("Remote branch ... not found", exit 128),
all hive clients fail to build, and ci-gate fails.

Build the erigon image from the checked-out commit (actions/checkout
reliably provides the merge_group commit) and wrap it with hive's default
prebuilt-image client Dockerfile, mirroring the kurtosis build-erigon-image
job. No remote-ref dependency, so the race is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as draft May 27, 2026 12:13
yperbasis and others added 3 commits May 27, 2026 14:15
The hive-group runners reuse the checkout dir, and the previous sparse
checkout (go.mod + hive-versions.json) left a stale .git/info/sparse-checkout
in `erigon-src`. A full checkout into the same path didn't clear it, so the
tree stayed sparse, cmd/erigon was missing, and the local erigon image build
failed (cd ./cmd/erigon: No such file or directory). Use a fresh path that
no sparse run has touched so every runner gets a complete tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
test-hive.yml now builds the erigon image with a GHA build cache (same as
the already-exempted test-kurtosis-* workflows), so it trips the
cache-poisoning audit. The cache is scoped and feeds only ephemeral test
images, never a published artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis added the QA label May 27, 2026
@yperbasis

Copy link
Copy Markdown
Member Author

Manually dispatched Test Hive (test-hive.yml) on this branch to validate the local-image build, since the PR is in draft and the hive job is draft-gated (so it doesn't run in CI Gate here).

Run: https://github.com/erigontech/erigon/actions/runs/26518396881

This exercises the new path: full erigon-full checkout → local docker build of the erigon image → Hive wraps it via the prebuilt-image Dockerfile (no clone of the ephemeral merge-queue ref).

test-hive-eest.yml had the same sparse erigon-src checkout + clone-based
Dockerfile.git mechanism as test-hive.yml, so it hit both the
ephemeral-merge-queue-ref clone race and the stale-sparse-checkout issue on
reused hive runners. Same fix: full erigon-full checkout + local docker
build + wrap Hive's prebuilt-image client Dockerfile (the erigontech/hive
fork's clients/erigon/Dockerfile carries the same baseimage/tag ARGs).
Exempt it from the zizmor cache-poisoning rule like the other
artifact-building workflows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis changed the title ci: build hive erigon client from local source, not an ephemeral merge-queue ref ci: build hive & hive-eest erigon clients from local source, not an ephemeral merge-queue ref May 27, 2026
@yperbasis

Copy link
Copy Markdown
Member Author

Extended the same fix to test-hive-eest.yml (identical sparse-erigon-src + clone-Dockerfile.git mechanism → same ephemeral-ref + stale-sparse-checkout bugs).

Manually dispatched Test hive-eest to validate it the same way (it's also draft-gated in CI Gate):
Run: https://github.com/erigontech/erigon/actions/runs/26519286240

For reference, the hive dispatch already proved the approach — its Build erigon image from local source step succeeded and the suites are running: https://github.com/erigontech/erigon/actions/runs/26518396881

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Hive-based CI jobs against merge-queue flakiness by removing the in-container git clone --branch <ephemeral gh-readonly-queue ref> dependency and instead building an Erigon image from the already checked-out workspace commit, then pointing Hive’s client Dockerfile at that locally built image.

Changes:

  • Switch Hive and Hive-EEST workflows to full checkout of the Erigon repo and build hive/erigon:cilocal from local source via docker/build-push-action.
  • Update Hive client Dockerfile patching to reference the locally built base image/tag instead of swapping in Dockerfile.git and cloning from GitHub.
  • Exempt the updated workflows from zizmor’s cache-poisoning rule due to the introduced GHA BuildKit cache usage.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
.github/zizmor.yml Adds test-hive.yml and test-hive-eest.yml to the cache-poisoning ignore list to accommodate new BuildKit GHA cache usage.
.github/workflows/test-hive.yml Builds Erigon image from local checkout and repoints Hive’s client Dockerfile to the local image to avoid ephemeral merge-queue refs.
.github/workflows/test-hive-eest.yml Applies the same local-image build approach to Hive-EEST and updates fixture sourcing paths accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/test-hive.yml Outdated
Comment thread .github/workflows/test-hive-eest.yml Outdated
Per Copilot review: docker/build-push-action with cache-to: type=gha made
every matrix job a concurrent writer to one cache scope (shared across
test-hive and test-hive-eest) — the "many writers to one type=gha scope"
504 failure mode that test-kurtosis-assertoor.yml centralizes its build to
avoid. Use a plain `docker build` on the host daemon (persistent layer cache
on the reused hive runners), matching the original clone-based build's
caching. Drops the now-unneeded zizmor cache-poisoning exemptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment thread .github/workflows/test-hive.yml
Comment thread .github/workflows/test-hive.yml
Comment thread .github/workflows/test-hive-eest.yml
Per Copilot review: the `sed -i` that repoints clients/erigon/Dockerfile at
hive/erigon:cilocal no-ops silently if upstream changes the ARG lines, which
would leave hive using the remote erigontech/erigon image. Add a grep guard
that exits non-zero if the rewrite didn't take.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review May 27, 2026 18:30
@yperbasis yperbasis requested a review from taratorio May 27, 2026 18:38

@Giulio2002 Giulio2002 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM \u2014 obviously small/trivial change (108 lines).

@taratorio taratorio requested a review from anacrolix May 28, 2026 07:33
@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit 76bd1e1 May 28, 2026
91 checks passed
@yperbasis yperbasis deleted the yperbasis/hive-local-build branch May 28, 2026 16:15
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request May 29, 2026
…igontech#21504)

## Why

Both merge-queue failures of erigontech#21374 were transient CI infrastructure
blips on network-dependent steps that had **no retry** — not problems
with the PR's code:

- **Docker Hub registry error** building hive's `hiveproxy` image: `Head
https://registry-1.docker.io/v2/library/alpine/manifests/latest:
unknown:` → `Tests: , Failed:` (zero tests ran).
- **github.com 403** during the in-builder erigon clone (`fatal: unable
to access 'https://github.com/erigontech/erigon/': 403`). That specific
path is already largely addressed by erigontech#21447 (build erigon locally
instead of cloning inside hive's builder), but the Docker Hub base-image
pulls remain in both `docker build` and `./hive`.

The merge-queue contract (per `CI-GUIDELINES.md`) is "a failure means
the code is wrong — zero false positives." These infra blips are exactly
the false positives that contract forbids, and they re-queue whole
batches.

## What

Add a small retry to the network-dependent steps in both `test-hive.yml`
and `test-hive-eest.yml`:

- **Build steps** (`docker build` of the local erigon image, `go get`,
`go build`): wrapped in a `retry()` helper — 3 attempts, linear backoff.
- **`./hive` run**: retried **only when too few tests were parsed**
(`tests < 4`) — the signature of a setup/image-build failure. A
completed run (`tests >= 4`) is judged on its first result and never
retried.

## Why this is safe for the merge queue

- A **genuine test failure is never retried** — only the *fast*
infra-setup failure path is, so the retry cannot mask a real regression.
- Because retries only trigger on the fast-fail (image build dying in
seconds), added latency is seconds + backoff, not multiplied test
runtime.
- This is step-level resilience, not reliance on merge-queue re-runs
(which `CI-GUIDELINES.md` explicitly discourages as a flake mask).

## Testing

- Verified the retry logic locally under `bash -e -o pipefail` (the
shell GitHub uses for `run:` steps): infra-fail-then-recover → passes
after retry; genuine failure → not retried, fails immediately;
persistent infra failure → retries to max then fails.
- `actionlint` clean — no new shellcheck findings (and removes one
pre-existing SC2181).
- `make lint` → 0 issues.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants