ci: surface merge-queue root cause when fail-fast cancels the run by yperbasis · Pull Request #21445 · erigontech/erigon

yperbasis · 2026-05-27T08:01:35Z

Problem

The merge-queue fail-fast optimization (#20789) cancels the entire CI Gate run as soon as one leaf job fails, so the gate doesn't stall ~30 min waiting for the heavy jobs (hive/kurtosis/eest) to finish. The downside: gh run cancel cancels the whole run including the leaf that called it, so the real culprit's conclusion flips failure → cancelled — visually identical to every innocent sibling. The ci-gate aggregator then lumps failure and cancelled together and prints only job names, so finding the actual failure means drilling into step-level conclusions by hand.

This surfaced when #21426 (a green, approved one-line PR) was silently evicted from the merge queue: a flaky data race in TestHistoryVerification_SimpleBlocks tripped the fast-cancel, but the run showed a sea of "cancelled" jobs with no indication which one failed.

Fix

Keep the latency win; make the root cause prominent.

Each leaf emits a GitHub ::error annotation right before gh run cancel. Only the true trigger reaches this step — collateral jobs have it skipped by the in-progress cancellation — so the annotation is attributed to the actual failing job and shows at the top of the run + in the PR Checks tab.
The ci-gate aggregator now names the root cause instead of dumping ambiguous job names. It identifies the trigger precisely — the job whose "Cancel workflow run on failure" step actually ran (success) — and annotates it plus its failing step. Falls back to listing all genuinely-failed jobs on pull_request runs (where no fast-cancel fires).

Validation

actionlint clean on all changed files (no new findings; the pre-existing shellcheck infos are untouched).
Replayed the aggregator's query against the real failed run that evicted node/cli: register --rpc.logs.maxresults in DefaultFlags so it takes effect via CLI #21426: it now outputs exactly
::error … race-tests … (execution-other, serial) — failed step: Run execution-other tests,
isolating the real culprit and excluding the 4 collateral hive-eest failures.

No behavior change to the fast-cancel itself — only added visibility.

🤖 Generated with Claude Code

The merge-queue fail-fast (#20789) cancels the whole CI Gate run on the first leaf failure. That flips the failing leaf's conclusion to "cancelled" — indistinguishable from the innocent siblings — and the ci-gate aggregator lumps failure and cancelled together, so the real culprit is invisible without digging into step-level logs. Surface it instead: - each leaf emits an ::error annotation before cancelling (only the true trigger runs this step; collateral jobs are skipped mid-cancellation) - the ci-gate aggregator names the root-cause leaf via the cancel-step-succeeded discriminator plus its failing step, falling back to all failed-step jobs on PR runs Keeps the fail-fast latency win intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves CI failure diagnosis for merge-queue runs that use fail-fast cancellation by surfacing the true first-failing job (the one that triggered gh run cancel) via GitHub ::error annotations and more targeted ci-gate output.

Changes:

Add a ::error annotation to each leaf workflow immediately before it fast-cancels the run in merge_group failures.
Enhance ci-gate to query the workflow jobs API and emit an annotation identifying the fast-cancel trigger job (and its failing step), with a fallback that reports genuinely failing jobs when no fast-cancel trigger is present.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
.github/workflows/test-kurtosis-assertoor.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/test-integration-caplin.yml	Emit an `::error` annotation before fast-cancelling (Linux + Windows jobs).
.github/workflows/test-hive.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/test-hive-eest.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/test-eest-spec.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/test-bench.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/test-all-erigon.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/test-all-erigon-race.yml	Emit an `::error` annotation before fast-cancelling (multiple jobs).
.github/workflows/sonar.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/reproducible-build.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/lint.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.
.github/workflows/ci-gate.yml	Query run jobs to identify/annotate the true fast-cancel root cause (and failing step).
.github/workflows/check-large-files.yml	Emit an `::error` annotation before fast-cancelling merge-queue runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…tcause-visibility

Giulio2002

LGTM — small, obviously safe CI workflow tweak that adds root-cause visibility before fail-fast cancellation in merge-queue runs.

The ci-gate aggregator's fallback fires on every pull_request failure (no fast-cancel there), so titling its annotation "Merge-queue root cause" mislabeled plain PR failures. Use a neutral title in both jq branches; merge-queue context is still carried by the leaf "Merge-queue root-cause failure" annotations, which only fire in merge_group. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

…ckly (erigontech#21483) ## Problem When a merge-queue run has a hive-eest shard fail, the failing job calls `gh run cancel ${{ github.run_id }}` (added in erigontech#21445). That sends SIGTERM to all in-flight matrix siblings, but the Docker-bound hive simulators take ~20 minutes to actually drain. `ci-gate` is `if: always()` and waits for every `needs` job to reach a terminal state, so the broken PR sits at `AWAITING_CHECKS` for the full drain time — blocking the head of the merge queue. Concrete example from today (PR erigontech#21470 at position #1): - 08:29:57 — `hive-eest / test-hive-eest (paris+shanghai, serial)` fails, calls `gh run cancel 26562610423`, emits the "Merge-queue root-cause failure" annotation from erigontech#21445. - 08:48 (~19 min later) — paris+shanghai-parallel, prague-serial/parallel, cancun-serial/parallel, osaka-parallel, rlp-serial/parallel, and glamsterdam-devnet-parallel were all still `in_progress`. Every other ci-gate child (tests, race-tests, eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had already completed. The bottleneck was specifically the hive-eest matrix siblings. ## Fix ```yaml strategy: fail-fast: ${{ github.event_name == 'merge_group' }} ``` - **In `merge_group`**: first failed shard immediately cancels all siblings at the GitHub API layer — much faster than the `gh run cancel` → SIGTERM → runner-drain path. ci-gate's `needs` reach terminal state in seconds, ci-gate fails, the broken PR is evicted. - **In PR runs**: stays `false`, so authors still see the full failure breakdown across every shard. No regression in PR feedback. ## What's left in place and why The per-job `gh run cancel` step (test-hive-eest.yml lines 311-317) stays. Two reasons: - Matrix `fail-fast` only cancels siblings **within the same matrix** — it doesn't cancel sibling reusable workflows. If a future failure pattern leaks across workflows, `gh run cancel` still covers it. - ci-gate.yml's root-cause annotator (line 188) keys off "the leaf that ran `gh run cancel` successfully" to single out the true root cause among collateral cancellations. Removing the step would silently regress erigontech#21445's attribution. ## Scope choice Only `test-hive-eest.yml` is changed. Other matrix-bearing reusable workflows (`test-all-erigon.yml`, `test-all-erigon-race.yml`, `test-eest-spec.yml`, `test-kurtosis-assertoor.yml`, `test-hive.yml`, `test-bench.yml`) all use `fail-fast: false` too, but none of them were the queue-blocking long pole in this incident. Keeping the patch minimal; we can generalize if another workflow becomes the bottleneck. ## Tradeoff to be aware of Queue runs will now show siblings as `cancelled` instead of `failed` whenever any one shard fails. That's the correct tradeoff in `merge_group` — the goal is fast eviction, not detailed diagnostics; full per-shard breakdown remains available on the PR run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

yperbasis requested review from lystopad and mriccobene as code owners May 27, 2026 08:01

yperbasis requested review from anacrolix, Copilot and taratorio May 27, 2026 09:49

Copilot started reviewing on behalf of yperbasis May 27, 2026 09:50 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread .github/workflows/ci-gate.yml

Merge remote-tracking branch 'origin/main' into yperbasis/ci-gate-roo…

cac50cf

…tcause-visibility

Giulio2002 approved these changes May 27, 2026

View reviewed changes

yperbasis requested a review from Copilot May 27, 2026 10:28

yperbasis enabled auto-merge May 27, 2026 10:28

Copilot started reviewing on behalf of yperbasis May 27, 2026 10:28 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

taratorio approved these changes May 27, 2026

View reviewed changes

yperbasis and others added 2 commits May 27, 2026 13:47

Merge branch 'main' into yperbasis/ci-gate-rootcause-visibility

8d71f19

Merge branch 'main' into yperbasis/ci-gate-rootcause-visibility

5d0f9ce

yperbasis added this pull request to the merge queue May 27, 2026

Merged via the queue into main with commit d97c3be May 27, 2026
92 checks passed

yperbasis deleted the yperbasis/ci-gate-rootcause-visibility branch May 27, 2026 14:47

yperbasis mentioned this pull request May 28, 2026

ci: fail-fast hive-eest matrix on merge_group so broken PRs evict quickly #21483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: surface merge-queue root cause when fail-fast cancels the run#21445

ci: surface merge-queue root cause when fail-fast cancels the run#21445
yperbasis merged 5 commits into
mainfrom
yperbasis/ci-gate-rootcause-visibility

yperbasis commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Giulio2002 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yperbasis commented May 27, 2026

Problem

Fix

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Giulio2002 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants