Skip to content

ci: surface merge-queue root cause when fail-fast cancels the run#21445

Merged
yperbasis merged 5 commits into
mainfrom
yperbasis/ci-gate-rootcause-visibility
May 27, 2026
Merged

ci: surface merge-queue root cause when fail-fast cancels the run#21445
yperbasis merged 5 commits into
mainfrom
yperbasis/ci-gate-rootcause-visibility

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Problem

The merge-queue fail-fast optimization (#20789) cancels the entire CI Gate run as soon as one leaf job fails, so the gate doesn't stall ~30 min waiting for the heavy jobs (hive/kurtosis/eest) to finish. The downside: gh run cancel cancels the whole run including the leaf that called it, so the real culprit's conclusion flips failure → cancelled — visually identical to every innocent sibling. The ci-gate aggregator then lumps failure and cancelled together and prints only job names, so finding the actual failure means drilling into step-level conclusions by hand.

This surfaced when #21426 (a green, approved one-line PR) was silently evicted from the merge queue: a flaky data race in TestHistoryVerification_SimpleBlocks tripped the fast-cancel, but the run showed a sea of "cancelled" jobs with no indication which one failed.

Fix

Keep the latency win; make the root cause prominent.

  • Each leaf emits a GitHub ::error annotation right before gh run cancel. Only the true trigger reaches this step — collateral jobs have it skipped by the in-progress cancellation — so the annotation is attributed to the actual failing job and shows at the top of the run + in the PR Checks tab.
  • The ci-gate aggregator now names the root cause instead of dumping ambiguous job names. It identifies the trigger precisely — the job whose "Cancel workflow run on failure" step actually ran (success) — and annotates it plus its failing step. Falls back to listing all genuinely-failed jobs on pull_request runs (where no fast-cancel fires).

Validation

  • actionlint clean on all changed files (no new findings; the pre-existing shellcheck infos are untouched).
  • Replayed the aggregator's query against the real failed run that evicted node/cli: register --rpc.logs.maxresults in DefaultFlags so it takes effect via CLI #21426: it now outputs exactly
    ::error … race-tests … (execution-other, serial) — failed step: Run execution-other tests,
    isolating the real culprit and excluding the 4 collateral hive-eest failures.

No behavior change to the fast-cancel itself — only added visibility.

🤖 Generated with Claude Code

The merge-queue fail-fast (#20789) cancels the whole CI Gate run on the
first leaf failure. That flips the failing leaf's conclusion to
"cancelled" — indistinguishable from the innocent siblings — and the
ci-gate aggregator lumps failure and cancelled together, so the real
culprit is invisible without digging into step-level logs.

Surface it instead:
- each leaf emits an ::error annotation before cancelling (only the true
  trigger runs this step; collateral jobs are skipped mid-cancellation)
- the ci-gate aggregator names the root-cause leaf via the
  cancel-step-succeeded discriminator plus its failing step, falling back
  to all failed-step jobs on PR runs

Keeps the fail-fast latency win intact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves CI failure diagnosis for merge-queue runs that use fail-fast cancellation by surfacing the true first-failing job (the one that triggered gh run cancel) via GitHub ::error annotations and more targeted ci-gate output.

Changes:

  • Add a ::error annotation to each leaf workflow immediately before it fast-cancels the run in merge_group failures.
  • Enhance ci-gate to query the workflow jobs API and emit an annotation identifying the fast-cancel trigger job (and its failing step), with a fallback that reports genuinely failing jobs when no fast-cancel trigger is present.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
.github/workflows/test-kurtosis-assertoor.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/test-integration-caplin.yml Emit an ::error annotation before fast-cancelling (Linux + Windows jobs).
.github/workflows/test-hive.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/test-hive-eest.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/test-eest-spec.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/test-bench.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/test-all-erigon.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/test-all-erigon-race.yml Emit an ::error annotation before fast-cancelling (multiple jobs).
.github/workflows/sonar.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/reproducible-build.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/lint.yml Emit an ::error annotation before fast-cancelling merge-queue runs.
.github/workflows/ci-gate.yml Query run jobs to identify/annotate the true fast-cancel root cause (and failing step).
.github/workflows/check-large-files.yml Emit an ::error annotation before fast-cancelling merge-queue runs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/ci-gate.yml

@Giulio2002 Giulio2002 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — small, obviously safe CI workflow tweak that adds root-cause visibility before fail-fast cancellation in merge-queue runs.

The ci-gate aggregator's fallback fires on every pull_request failure
(no fast-cancel there), so titling its annotation "Merge-queue root
cause" mislabeled plain PR failures. Use a neutral title in both jq
branches; merge-queue context is still carried by the leaf
"Merge-queue root-cause failure" annotations, which only fire in
merge_group.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

@yperbasis yperbasis added this pull request to the merge queue May 27, 2026
Merged via the queue into main with commit d97c3be May 27, 2026
92 checks passed
@yperbasis yperbasis deleted the yperbasis/ci-gate-rootcause-visibility branch May 27, 2026 14:47
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 28, 2026
…ckly (erigontech#21483)

## Problem

When a merge-queue run has a hive-eest shard fail, the failing job calls
`gh run cancel ${{ github.run_id }}` (added in erigontech#21445). That sends
SIGTERM to all in-flight matrix siblings, but the Docker-bound hive
simulators take ~20 minutes to actually drain. `ci-gate` is `if:
always()` and waits for every `needs` job to reach a terminal state, so
the broken PR sits at `AWAITING_CHECKS` for the full drain time —
blocking the head of the merge queue.

Concrete example from today (PR erigontech#21470 at position #1):

- 08:29:57 — `hive-eest / test-hive-eest (paris+shanghai, serial)`
fails, calls `gh run cancel 26562610423`, emits the "Merge-queue
root-cause failure" annotation from erigontech#21445.
- 08:48 (~19 min later) — paris+shanghai-parallel,
prague-serial/parallel, cancun-serial/parallel, osaka-parallel,
rlp-serial/parallel, and glamsterdam-devnet-parallel were all still
`in_progress`. Every other ci-gate child (tests, race-tests,
eest-spec-tests, kurtosis, hive, lint, bench, repro, sonar, caplin) had
already completed.

The bottleneck was specifically the hive-eest matrix siblings.

## Fix

```yaml
strategy:
  fail-fast: ${{ github.event_name == 'merge_group' }}
```

- **In `merge_group`**: first failed shard immediately cancels all
siblings at the GitHub API layer — much faster than the `gh run cancel`
→ SIGTERM → runner-drain path. ci-gate's `needs` reach terminal state in
seconds, ci-gate fails, the broken PR is evicted.
- **In PR runs**: stays `false`, so authors still see the full failure
breakdown across every shard. No regression in PR feedback.

## What's left in place and why

The per-job `gh run cancel` step (test-hive-eest.yml lines 311-317)
stays. Two reasons:

- Matrix `fail-fast` only cancels siblings **within the same matrix** —
it doesn't cancel sibling reusable workflows. If a future failure
pattern leaks across workflows, `gh run cancel` still covers it.
- ci-gate.yml's root-cause annotator (line 188) keys off "the leaf that
ran `gh run cancel` successfully" to single out the true root cause
among collateral cancellations. Removing the step would silently regress
erigontech#21445's attribution.

## Scope choice

Only `test-hive-eest.yml` is changed. Other matrix-bearing reusable
workflows (`test-all-erigon.yml`, `test-all-erigon-race.yml`,
`test-eest-spec.yml`, `test-kurtosis-assertoor.yml`, `test-hive.yml`,
`test-bench.yml`) all use `fail-fast: false` too, but none of them were
the queue-blocking long pole in this incident. Keeping the patch
minimal; we can generalize if another workflow becomes the bottleneck.

## Tradeoff to be aware of

Queue runs will now show siblings as `cancelled` instead of `failed`
whenever any one shard fails. That's the correct tradeoff in
`merge_group` — the goal is fast eviction, not detailed diagnostics;
full per-shard breakdown remains available on the PR run.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants