ci: surface merge-queue root cause when fail-fast cancels the run#21445
Merged
Conversation
The merge-queue fail-fast (#20789) cancels the whole CI Gate run on the first leaf failure. That flips the failing leaf's conclusion to "cancelled" — indistinguishable from the innocent siblings — and the ci-gate aggregator lumps failure and cancelled together, so the real culprit is invisible without digging into step-level logs. Surface it instead: - each leaf emits an ::error annotation before cancelling (only the true trigger runs this step; collateral jobs are skipped mid-cancellation) - the ci-gate aggregator names the root-cause leaf via the cancel-step-succeeded discriminator plus its failing step, falling back to all failed-step jobs on PR runs Keeps the fail-fast latency win intact. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves CI failure diagnosis for merge-queue runs that use fail-fast cancellation by surfacing the true first-failing job (the one that triggered gh run cancel) via GitHub ::error annotations and more targeted ci-gate output.
Changes:
- Add a
::errorannotation to each leaf workflow immediately before it fast-cancels the run inmerge_groupfailures. - Enhance
ci-gateto query the workflow jobs API and emit an annotation identifying the fast-cancel trigger job (and its failing step), with a fallback that reports genuinely failing jobs when no fast-cancel trigger is present.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/test-kurtosis-assertoor.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/test-integration-caplin.yml | Emit an ::error annotation before fast-cancelling (Linux + Windows jobs). |
| .github/workflows/test-hive.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/test-hive-eest.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/test-eest-spec.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/test-bench.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/test-all-erigon.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/test-all-erigon-race.yml | Emit an ::error annotation before fast-cancelling (multiple jobs). |
| .github/workflows/sonar.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/reproducible-build.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/lint.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
| .github/workflows/ci-gate.yml | Query run jobs to identify/annotate the true fast-cancel root cause (and failing step). |
| .github/workflows/check-large-files.yml | Emit an ::error annotation before fast-cancelling merge-queue runs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tcause-visibility
Giulio2002
approved these changes
May 27, 2026
Giulio2002
left a comment
Contributor
There was a problem hiding this comment.
LGTM — small, obviously safe CI workflow tweak that adds root-cause visibility before fail-fast cancellation in merge-queue runs.
The ci-gate aggregator's fallback fires on every pull_request failure (no fast-cancel there), so titling its annotation "Merge-queue root cause" mislabeled plain PR failures. Use a neutral title in both jq branches; merge-queue context is still carried by the leaf "Merge-queue root-cause failure" annotations, which only fire in merge_group. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
taratorio
approved these changes
May 27, 2026
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 28, 2026
…ckly (erigontech#21483) ## Problem When a merge-queue run has a hive-eest shard fail, the failing job calls `gh run cancel ${{ github.run_id }}` (added in erigontech#21445). That sends SIGTERM to all in-flight matrix siblings, but the Docker-bound hive simulators take ~20 minutes to actually drain. `ci-gate` is `if: always()` and waits for every `needs` job to reach a terminal state, so the broken PR sits at `AWAITING_CHECKS` for the full drain time — blocking the head of the merge queue. Concrete example from today (PR erigontech#21470 at position #1): - 08:29:57 — `hive-eest / test-hive-eest (paris+shanghai, serial)` fails, calls `gh run cancel 26562610423`, emits the "Merge-queue root-cause failure" annotation from erigontech#21445. - 08:48 (~19 min later) — paris+shanghai-parallel, prague-serial/parallel, cancun-serial/parallel, osaka-parallel, rlp-serial/parallel, and glamsterdam-devnet-parallel were all still `in_progress`. Every other ci-gate child (tests, race-tests, eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had already completed. The bottleneck was specifically the hive-eest matrix siblings. ## Fix ```yaml strategy: fail-fast: ${{ github.event_name == 'merge_group' }} ``` - **In `merge_group`**: first failed shard immediately cancels all siblings at the GitHub API layer — much faster than the `gh run cancel` → SIGTERM → runner-drain path. ci-gate's `needs` reach terminal state in seconds, ci-gate fails, the broken PR is evicted. - **In PR runs**: stays `false`, so authors still see the full failure breakdown across every shard. No regression in PR feedback. ## What's left in place and why The per-job `gh run cancel` step (test-hive-eest.yml lines 311-317) stays. Two reasons: - Matrix `fail-fast` only cancels siblings **within the same matrix** — it doesn't cancel sibling reusable workflows. If a future failure pattern leaks across workflows, `gh run cancel` still covers it. - ci-gate.yml's root-cause annotator (line 188) keys off "the leaf that ran `gh run cancel` successfully" to single out the true root cause among collateral cancellations. Removing the step would silently regress erigontech#21445's attribution. ## Scope choice Only `test-hive-eest.yml` is changed. Other matrix-bearing reusable workflows (`test-all-erigon.yml`, `test-all-erigon-race.yml`, `test-eest-spec.yml`, `test-kurtosis-assertoor.yml`, `test-hive.yml`, `test-bench.yml`) all use `fail-fast: false` too, but none of them were the queue-blocking long pole in this incident. Keeping the patch minimal; we can generalize if another workflow becomes the bottleneck. ## Tradeoff to be aware of Queue runs will now show siblings as `cancelled` instead of `failed` whenever any one shard fails. That's the correct tradeoff in `merge_group` — the goal is fast eviction, not detailed diagnostics; full per-shard breakdown remains available on the PR run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The merge-queue fail-fast optimization (#20789) cancels the entire CI Gate run as soon as one leaf job fails, so the gate doesn't stall ~30 min waiting for the heavy jobs (hive/kurtosis/eest) to finish. The downside:
gh run cancelcancels the whole run including the leaf that called it, so the real culprit's conclusion flipsfailure → cancelled— visually identical to every innocent sibling. Theci-gateaggregator then lumpsfailureandcancelledtogether and prints only job names, so finding the actual failure means drilling into step-level conclusions by hand.This surfaced when #21426 (a green, approved one-line PR) was silently evicted from the merge queue: a flaky data race in
TestHistoryVerification_SimpleBlockstripped the fast-cancel, but the run showed a sea of "cancelled" jobs with no indication which one failed.Fix
Keep the latency win; make the root cause prominent.
::errorannotation right beforegh run cancel. Only the true trigger reaches this step — collateral jobs have itskippedby the in-progress cancellation — so the annotation is attributed to the actual failing job and shows at the top of the run + in the PR Checks tab.ci-gateaggregator now names the root cause instead of dumping ambiguous job names. It identifies the trigger precisely — the job whose "Cancel workflow run on failure" step actually ran (success) — and annotates it plus its failing step. Falls back to listing all genuinely-failed jobs onpull_requestruns (where no fast-cancel fires).Validation
actionlintclean on all changed files (no new findings; the pre-existing shellcheck infos are untouched).::error … race-tests … (execution-other, serial) — failed step: Run execution-other tests,isolating the real culprit and excluding the 4 collateral
hive-eestfailures.No behavior change to the fast-cancel itself — only added visibility.
🤖 Generated with Claude Code