Skip to content

ci: fail-fast ci-gate in merge queue by cancelling on leaf failure#20789

Merged
yperbasis merged 1 commit into
mainfrom
yperbasis/ci-gate-fail-fast
Apr 24, 2026
Merged

ci: fail-fast ci-gate in merge queue by cancelling on leaf failure#20789
yperbasis merged 1 commit into
mainfrom
yperbasis/ci-gate-fail-fast

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Summary

  • When any required leaf workflow (lint, tests, race-tests, hive, caplin, etc.) fails in a merge-queue run, the failing leaf now calls gh run cancel on the current run. Siblings flip to cancelled within seconds, and the ci-gate job (still if: always() + needs: [...]) runs immediately and reports failure — the queue dequeues the broken PR instead of waiting for the slowest sub-workflow to finish.
  • Gated on github.event_name == 'merge_group' so PR runs still execute every leaf to completion — authors see every failure, not just the first one.
  • Workflow-level permissions: { actions: write, contents: read } on ci-gate.yml so the reusable leaves inherit the permission needed to call the cancel endpoint.

Why not a polling / needs-less approach

GitHub's needs: semantics require every listed need to reach a terminal state before the dependent job can run, so a needs: [...] + if: always() gate cannot natively fail fast. The only way to make siblings terminate early is to cancel them — so that's what this does, from whichever leaf fails first. The ci-gate job itself is unchanged.

Test plan

  • After merge, observe a real merge-queue failure and confirm ci-gate surfaces failure within seconds of the first leaf failing rather than waiting for the slowest sub-workflow.
  • Confirm PR runs still run every leaf to completion (cancel step is gated on merge_group).
  • Fork PRs: the gated if still fires only in merge_group, so fork-PR cancel permissions are a non-issue.

🤖 Generated with Claude Code

When any required job fails, the failing leaf calls `gh run cancel` to
abort the run. Siblings flip to `cancelled`, and ci-gate (still `if:
always()`) runs immediately and reports failure — merge queue dequeues
the broken PR minutes earlier instead of waiting for the slowest
sub-workflow.

Gated on `github.event_name == 'merge_group'` so PR runs still execute
every leaf to completion — authors see the full failure picture.
Workflow-level `actions: write` on ci-gate.yml is inherited by the
reusable leaves so they can call the cancel endpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make merge-queue (merge_group) CI fail fast by having the first failing “leaf” reusable workflow cancel the current run, causing sibling jobs to transition to cancelled quickly so the ci-gate aggregator can finish sooner.

Changes:

  • Add a “Cancel workflow run on failure” step to multiple leaf reusable workflows, gated to failure() && github.event_name == 'merge_group'.
  • Grant actions: write (and contents: read) permissions in ci-gate.yml so called workflows can cancel runs via the GitHub API (through gh run cancel).
  • Preserve existing PR behavior by only canceling in merge-queue runs (PR runs continue to completion).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/test-kurtosis-assertoor.yml Cancel merge-queue run on leaf failure.
.github/workflows/test-integration-caplin.yml Cancel merge-queue run on leaf failure (Linux + Windows jobs).
.github/workflows/test-hive.yml Cancel merge-queue run on leaf failure (self-hosted hive runner group).
.github/workflows/test-hive-eest.yml Cancel merge-queue run on leaf failure (self-hosted hive runner group).
.github/workflows/test-bench.yml Cancel merge-queue run on leaf failure.
.github/workflows/test-all-erigon.yml Cancel merge-queue run on leaf failure.
.github/workflows/test-all-erigon-race.yml Cancel merge-queue run on leaf failure (matrix loader + test job).
.github/workflows/sonar.yml Cancel merge-queue run on leaf failure.
.github/workflows/reproducible-build.yml Cancel merge-queue run on leaf failure.
.github/workflows/lint.yml Cancel merge-queue run on leaf failure.
.github/workflows/ci-gate.yml Add actions: write permission so leaves can cancel runs.
.github/workflows/check-large-files.yml Cancel merge-queue run on leaf failure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/ci-gate.yml
@yperbasis yperbasis enabled auto-merge April 24, 2026 13:19
@yperbasis yperbasis added this pull request to the merge queue Apr 24, 2026
Merged via the queue into main with commit 9baae53 Apr 24, 2026
43 checks passed
@yperbasis yperbasis deleted the yperbasis/ci-gate-fail-fast branch April 24, 2026 17:11
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 27, 2026
…igontech#21445)

## Problem

The merge-queue fail-fast optimization (erigontech#20789) cancels the entire CI
Gate run as soon as one leaf job fails, so the gate doesn't stall ~30
min waiting for the heavy jobs (hive/kurtosis/eest) to finish. The
downside: `gh run cancel` cancels the *whole* run including the leaf
that called it, so the real culprit's conclusion flips `failure →
cancelled` — visually identical to every innocent sibling. The `ci-gate`
aggregator then lumps `failure` and `cancelled` together and prints only
job names, so finding the actual failure means drilling into step-level
conclusions by hand.

This surfaced when erigontech#21426 (a green, approved one-line PR) was silently
evicted from the merge queue: a flaky data race in
`TestHistoryVerification_SimpleBlocks` tripped the fast-cancel, but the
run showed a sea of "cancelled" jobs with no indication which one
failed.

## Fix

Keep the latency win; make the root cause prominent.

- **Each leaf** emits a GitHub `::error` annotation right before `gh run
cancel`. Only the *true trigger* reaches this step — collateral jobs
have it `skipped` by the in-progress cancellation — so the annotation is
attributed to the actual failing job and shows at the top of the run +
in the PR Checks tab.
- **The `ci-gate` aggregator** now names the root cause instead of
dumping ambiguous job names. It identifies the trigger precisely — the
job whose "Cancel workflow run on failure" step actually ran (`success`)
— and annotates it plus its failing step. Falls back to listing all
genuinely-failed jobs on `pull_request` runs (where no fast-cancel
fires).

## Validation

- `actionlint` clean on all changed files (no new findings; the
pre-existing shellcheck infos are untouched).
- Replayed the aggregator's query against the real failed run that
evicted erigontech#21426: it now outputs exactly
`::error … race-tests … (execution-other, serial) — failed step: Run
execution-other tests`,
isolating the real culprit and excluding the 4 collateral `hive-eest`
failures.

No behavior change to the fast-cancel itself — only added visibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants