ci: fail-fast ci-gate in merge queue by cancelling on leaf failure#20789
Merged
Conversation
When any required job fails, the failing leaf calls `gh run cancel` to abort the run. Siblings flip to `cancelled`, and ci-gate (still `if: always()`) runs immediately and reports failure — merge queue dequeues the broken PR minutes earlier instead of waiting for the slowest sub-workflow. Gated on `github.event_name == 'merge_group'` so PR runs still execute every leaf to completion — authors see the full failure picture. Workflow-level `actions: write` on ci-gate.yml is inherited by the reusable leaves so they can call the cancel endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to make merge-queue (merge_group) CI fail fast by having the first failing “leaf” reusable workflow cancel the current run, causing sibling jobs to transition to cancelled quickly so the ci-gate aggregator can finish sooner.
Changes:
- Add a “Cancel workflow run on failure” step to multiple leaf reusable workflows, gated to
failure() && github.event_name == 'merge_group'. - Grant
actions: write(andcontents: read) permissions inci-gate.ymlso called workflows can cancel runs via the GitHub API (throughgh run cancel). - Preserve existing PR behavior by only canceling in merge-queue runs (PR runs continue to completion).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/test-kurtosis-assertoor.yml | Cancel merge-queue run on leaf failure. |
| .github/workflows/test-integration-caplin.yml | Cancel merge-queue run on leaf failure (Linux + Windows jobs). |
| .github/workflows/test-hive.yml | Cancel merge-queue run on leaf failure (self-hosted hive runner group). |
| .github/workflows/test-hive-eest.yml | Cancel merge-queue run on leaf failure (self-hosted hive runner group). |
| .github/workflows/test-bench.yml | Cancel merge-queue run on leaf failure. |
| .github/workflows/test-all-erigon.yml | Cancel merge-queue run on leaf failure. |
| .github/workflows/test-all-erigon-race.yml | Cancel merge-queue run on leaf failure (matrix loader + test job). |
| .github/workflows/sonar.yml | Cancel merge-queue run on leaf failure. |
| .github/workflows/reproducible-build.yml | Cancel merge-queue run on leaf failure. |
| .github/workflows/lint.yml | Cancel merge-queue run on leaf failure. |
| .github/workflows/ci-gate.yml | Add actions: write permission so leaves can cancel runs. |
| .github/workflows/check-large-files.yml | Cancel merge-queue run on leaf failure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mriccobene
approved these changes
Apr 24, 2026
3 tasks
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 27, 2026
…igontech#21445) ## Problem The merge-queue fail-fast optimization (erigontech#20789) cancels the entire CI Gate run as soon as one leaf job fails, so the gate doesn't stall ~30 min waiting for the heavy jobs (hive/kurtosis/eest) to finish. The downside: `gh run cancel` cancels the *whole* run including the leaf that called it, so the real culprit's conclusion flips `failure → cancelled` — visually identical to every innocent sibling. The `ci-gate` aggregator then lumps `failure` and `cancelled` together and prints only job names, so finding the actual failure means drilling into step-level conclusions by hand. This surfaced when erigontech#21426 (a green, approved one-line PR) was silently evicted from the merge queue: a flaky data race in `TestHistoryVerification_SimpleBlocks` tripped the fast-cancel, but the run showed a sea of "cancelled" jobs with no indication which one failed. ## Fix Keep the latency win; make the root cause prominent. - **Each leaf** emits a GitHub `::error` annotation right before `gh run cancel`. Only the *true trigger* reaches this step — collateral jobs have it `skipped` by the in-progress cancellation — so the annotation is attributed to the actual failing job and shows at the top of the run + in the PR Checks tab. - **The `ci-gate` aggregator** now names the root cause instead of dumping ambiguous job names. It identifies the trigger precisely — the job whose "Cancel workflow run on failure" step actually ran (`success`) — and annotates it plus its failing step. Falls back to listing all genuinely-failed jobs on `pull_request` runs (where no fast-cancel fires). ## Validation - `actionlint` clean on all changed files (no new findings; the pre-existing shellcheck infos are untouched). - Replayed the aggregator's query against the real failed run that evicted erigontech#21426: it now outputs exactly `::error … race-tests … (execution-other, serial) — failed step: Run execution-other tests`, isolating the real culprit and excluding the 4 collateral `hive-eest` failures. No behavior change to the fast-cancel itself — only added visibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gh run cancelon the current run. Siblings flip tocancelledwithin seconds, and theci-gatejob (stillif: always()+needs: [...]) runs immediately and reports failure — the queue dequeues the broken PR instead of waiting for the slowest sub-workflow to finish.github.event_name == 'merge_group'so PR runs still execute every leaf to completion — authors see every failure, not just the first one.permissions: { actions: write, contents: read }onci-gate.ymlso the reusable leaves inherit the permission needed to call the cancel endpoint.Why not a polling / needs-less approach
GitHub's
needs:semantics require every listed need to reach a terminal state before the dependent job can run, so aneeds: [...]+if: always()gate cannot natively fail fast. The only way to make siblings terminate early is to cancel them — so that's what this does, from whichever leaf fails first. Theci-gatejob itself is unchanged.Test plan
failurewithin seconds of the first leaf failing rather than waiting for the slowest sub-workflow.merge_group).ifstill fires only inmerge_group, so fork-PR cancel permissions are a non-issue.🤖 Generated with Claude Code