tools: fail EEST spec shards on race-detector exit code 66#21640
Merged
Conversation
The Go race runtime reports detected races to stderr and overrides the process exit code to 66 (even through os.Exit(0)), while the run completes normally and the JSON report stays valid. run-eest-spec-test.sh discarded all exit codes with '|| true' and decided pass/fail from the JSON alone, so a *-race-* shard whose binary detected a data race still passed unless the race also corrupted a test result. Capture the exit code and fail the shard explicitly on 66; all other exit-code tolerance is unchanged (crashes still surface as missing/truncated JSON).
Member
Author
AskAlexSharov
approved these changes
Jun 6, 2026
taratorio
added a commit
that referenced
this pull request
Jun 8, 2026
…r branch
SonarCloud rejects a branch analysis dated older than the branch's
latest processed one. All merge-queue entries analyze the target branch,
so when entry N's report uploaded after entry N+1's, SonarCloud rejected
it and flipped the commit's SonarCloud check to cancelled ("Date of
analysis cannot be older than the date of the last known analysis").
Seen on 5027d25, where the scans of three stacked queue entries
(#21648, #21584, #21640) raced and #21648's analysis lost.
Restructure so branch analyses can never be submitted out of order:
- merge_group runs keep `make test-sonar-coverage` as a gate check but
no longer scan; they upload coverage-test-all.out as a short-retention
artifact instead.
- A new push-triggered SonarCloud Branch Scan workflow (main and
release/**, matching the branches SonarCloud tracks) downloads that
artifact — the merge queue fast-forwards the target branch to the
already-tested merge commit, so the pushed SHA equals the merge-queue
run's head SHA and the artifact maps to the commit exactly — and runs
only the scanner (~7 min instead of ~25). If no artifact exists
(direct push, expired retention), it falls back to running the tests.
- Scans are serialized per branch FIFO via `concurrency.queue: max`
(GitHub Actions, 2026-05), so every merged commit is analyzed, in
order, with no runs cancelled in merge bursts.
- PR scans (Sonar PR analyses, no main-branch date ordering) and
cache-warming runs are unchanged.
actionlint (≤ v1.7.12) does not know the `queue` key yet, so lint.yml
gains a narrow --ignore alongside the existing workspace one.
A failed analysis upload also no longer involves the gate: scanner
failures now surface on a post-merge run instead, while PR runs keep
exercising the scanner so breakage is still caught before merge.
Sahil-4555
pushed a commit
to Sahil-4555/erigon
that referenced
this pull request
Jun 8, 2026
…r branch (erigontech#21669) ## Problem SonarCloud rejects a branch analysis dated older than the branch's latest processed one. Every merge-queue entry analyzes the target branch, so when queue entry N's report uploads *after* entry N+1's, SonarCloud rejects N's with *"Date of analysis cannot be older than the date of the last known analysis"* and flips that commit's **SonarCloud Code Analysis** check to `cancelled`. Seen on [`5027d259c2`](erigontech@5027d25): three stacked queue entries (erigontech#21648, erigontech#21584, erigontech#21640) scanned concurrently; erigontech#21584 started 6s later but uploaded 21s earlier, so erigontech#21648's analysis was rejected as out of order. No impact on the gate (the scan step is `continue-on-error` and the merge succeeded), but it leaves a spurious failed check on a merged commit. ## Fix Make it structurally impossible to submit branch analyses out of order: - **`merge_group` runs** keep `make test-sonar-coverage` as a gate check but **no longer scan**. They upload `coverage-test-all.out` as a 7-day artifact. - **A new `SonarCloud Branch Scan` workflow** (`push` on `main` + `release/**`, matching the branches SonarCloud tracks) downloads that artifact and runs **only the scanner** (~7 min vs ~25). The merge queue fast-forwards the target branch to the already-tested merge commit, so the pushed SHA equals the merge-queue run's head SHA and the artifact maps to the commit exactly. If no artifact exists (direct push, or expired retention), it falls back to running the tests itself. - Branch scans are serialized per branch **FIFO** via `concurrency.queue: max` ([GitHub Actions, 2026-05](https://github.blog/changelog/2026-05-07-github-actions-concurrency-groups-now-allow-larger-queues/)), so every merged commit is analyzed, in order, with **no runs cancelled** during merge bursts. - **PR scans** (Sonar PR analyses — separate from main-branch date ordering) and **cache-warming** runs are unchanged. A failed analysis upload no longer touches the gate at all: scanner failures now surface on a post-merge run, while PR runs keep exercising the scanner so config/scanner breakage is still caught before merge. ## Notes - `actionlint` (≤ v1.7.12, the latest release) does not know the `queue` concurrency key yet, so `lint.yml` gets a narrow `--ignore` alongside the existing `workspace` one. Removable once actionlint learns the key. - `concurrency.queue: max` is ~1 month old — first real merge-burst after this lands is where FIFO ordering gets exercised live. ## Testing - `make lint`, `actionlint` (CI flags), and `zizmor` (repo config) all clean; the new workflow is finding-free. - The artifact-lookup script was shellchecked and run against live API data: correctly resolves the merge-queue run by head SHA, filters expired artifacts, and degrades to the test-running fallback on a missing artifact or total API failure (never fails the job).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
tools/run-eest-spec-test.shinvokes the cmd/evm runner with|| trueand decides shard pass/fail from the--jsonoutreport alone. That design is correct for the non-race shards (the runners always exit 0 and report via JSON; crashes surface as missing/truncated JSON, which the script catches via thetotal == 0check andset -euo pipefailon jq).It is wrong for the
*-race-*shards: the Go race runtime reports detected races to stderr and overrides the process exit code to 66 (even throughos.Exit(0)), while the run completes normally and the JSON stays fully valid with every test passing. The|| truediscards that exit code, so a race shard whose binary detected a data race still went green — theWARNING: DATA RACEreport only visible in the job log. The race detector's whole value is catching races that don't happen to corrupt results, so this silently weakened the guarantee of every race shard:(and
zkevm-witness-race-allonce #21629 lands).Pre-existing since the shards were introduced in #21092; surfaced while reviewing #21629, which doubles the corpus running under the race detector.
Fix
Capture the runner's exit code and fail the shard explicitly on 66 with a pointer to the race report above it in the log. All other exit-code tolerance is unchanged.
Verification (local repro)
Built a minimal
-raceGo binary containing a genuine data race that emits valid passing JSON ([{"name":"t1","pass":true}]) and callsos.Exit(0), then ran it through the real script asEVM_BINagainstblocktests-stable-race-cancun-sequential:Found 1 data race(s)and exits 66 → script printsran 1 tests, 0 failedand exits 0 (green despite the race).ERROR: data race detected (race runtime exit code 66); see WARNING: DATA RACE in the log above.Regression matrix with stub binaries through the patched script, confirming only the race case changes behavior:
shellcheck --severity=warning(the lint-workflow invocation) is clean on the modified script;make lintclean.The race shards running on this PR exercise the patched path end-to-end in CI (race-free run →
rc=0→ behavior identical to before).Note on TDD
This is a shell-script CI guard with no test harness; the Red → Green cycle above (reproduce the masked exit code through the unmodified script, then flip it with the fix) stands in for a checked-in failing test.
End-to-end verification with a real race in erigon code
Repeated the verification with the full production pipeline instead of stub binaries: planted a deliberate data race in
IntraBlockState.CommitBlock(two unsynchronized goroutines incrementing a package var, executed once per imported block — never committed), built the realevm.raceviamake evm.race, and ran a single real EEST fixture (cancun/eip1153_tstore/test_basic_tload_after_store.json, onefork_Cancuntest) through theblocktests-stable-race-cancun-sequentialshard:intra_block_state.goinCommitBlock.func1, binary printsFound 1 data race(s)and exits 66 → script printsran 1 tests, 0 failed, exit 0 — green despite the race.ERROR: data race detected (race runtime exit code 66); see WARNING: DATA RACE in the log above.evm.racerebuilt clean, same fixture through this PR's script → 0 race warnings,ran 1 tests, 0 failed, exit 0 — no false positive.