tools: fail EEST spec shards on race-detector exit code 66 by taratorio · Pull Request #21640 · erigontech/erigon

taratorio · 2026-06-05T13:53:30Z

Problem

tools/run-eest-spec-test.sh invokes the cmd/evm runner with || true and decides shard pass/fail from the --jsonout report alone. That design is correct for the non-race shards (the runners always exit 0 and report via JSON; crashes surface as missing/truncated JSON, which the script catches via the total == 0 check and set -euo pipefail on jq).

It is wrong for the *-race-* shards: the Go race runtime reports detected races to stderr and overrides the process exit code to 66 (even through os.Exit(0)), while the run completes normally and the JSON stays fully valid with every test passing. The || true discards that exit code, so a race shard whose binary detected a data race still went green — the WARNING: DATA RACE report only visible in the job log. The race detector's whole value is catching races that don't happen to corrupt results, so this silently weakened the guarantee of every race shard:

blocktests-stable-race-{pre-cancun,cancun,prague,osaka}-{sequential,parallel}
blocktests-devnet-race-amsterdam

(and zkevm-witness-race-all once #21629 lands).

Pre-existing since the shards were introduced in #21092; surfaced while reviewing #21629, which doubles the corpus running under the race detector.

Fix

Capture the runner's exit code and fail the shard explicitly on 66 with a pointer to the race report above it in the log. All other exit-code tolerance is unchanged.

Verification (local repro)

Built a minimal -race Go binary containing a genuine data race that emits valid passing JSON ([{"name":"t1","pass":true}]) and calls os.Exit(0), then ran it through the real script as EVM_BIN against blocktests-stable-race-cancun-sequential:

Before (unmodified script): binary prints Found 1 data race(s) and exits 66 → script prints ran 1 tests, 0 failed and exits 0 (green despite the race).
After: same run exits 1 with ERROR: data race detected (race runtime exit code 66); see WARNING: DATA RACE in the log above.

Regression matrix with stub binaries through the patched script, confirming only the race case changes behavior:

Scenario	Before	After
race detected (exit 66, valid passing JSON)	exit 0	exit 1
clean pass (exit 0)	exit 0	exit 0
non-zero exit ≠ 66 with valid JSON	exit 0 (tolerated by design)	exit 0
test failures over budget	exit 1	exit 1

shellcheck --severity=warning (the lint-workflow invocation) is clean on the modified script; make lint clean.

The race shards running on this PR exercise the patched path end-to-end in CI (race-free run → rc=0 → behavior identical to before).

Note on TDD

This is a shell-script CI guard with no test harness; the Red → Green cycle above (reproduce the masked exit code through the unmodified script, then flip it with the fix) stands in for a checked-in failing test.

End-to-end verification with a real race in erigon code

Repeated the verification with the full production pipeline instead of stub binaries: planted a deliberate data race in IntraBlockState.CommitBlock (two unsynchronized goroutines incrementing a package var, executed once per imported block — never committed), built the real evm.race via make evm.race, and ran a single real EEST fixture (cancun/eip1153_tstore/test_basic_tload_after_store.json, one fork_Cancun test) through the blocktests-stable-race-cancun-sequential shard:

old script (origin/main): race report correctly attributes intra_block_state.go in CommitBlock.func1, binary prints Found 1 data race(s) and exits 66 → script prints ran 1 tests, 0 failed, exit 0 — green despite the race.
this PR's script: same binary, same fixture → exit 1, last line ERROR: data race detected (race runtime exit code 66); see WARNING: DATA RACE in the log above.
control: race reverted, evm.race rebuilt clean, same fixture through this PR's script → 0 race warnings, ran 1 tests, 0 failed, exit 0 — no false positive.

The Go race runtime reports detected races to stderr and overrides the process exit code to 66 (even through os.Exit(0)), while the run completes normally and the JSON report stays valid. run-eest-spec-test.sh discarded all exit codes with '|| true' and decided pass/fail from the JSON alone, so a *-race-* shard whose binary detected a data race still passed unless the race also corrupted a test result. Capture the exit code and fail the shard explicitly on 66; all other exit-code tolerance is unchanged (crashes still surface as missing/truncated JSON).

taratorio · 2026-06-06T03:06:11Z

@awskii this is a bug from before that got spotted while reviewing #21629

…r branch SonarCloud rejects a branch analysis dated older than the branch's latest processed one. All merge-queue entries analyze the target branch, so when entry N's report uploaded after entry N+1's, SonarCloud rejected it and flipped the commit's SonarCloud check to cancelled ("Date of analysis cannot be older than the date of the last known analysis"). Seen on 5027d25, where the scans of three stacked queue entries (#21648, #21584, #21640) raced and #21648's analysis lost. Restructure so branch analyses can never be submitted out of order: - merge_group runs keep `make test-sonar-coverage` as a gate check but no longer scan; they upload coverage-test-all.out as a short-retention artifact instead. - A new push-triggered SonarCloud Branch Scan workflow (main and release/**, matching the branches SonarCloud tracks) downloads that artifact — the merge queue fast-forwards the target branch to the already-tested merge commit, so the pushed SHA equals the merge-queue run's head SHA and the artifact maps to the commit exactly — and runs only the scanner (~7 min instead of ~25). If no artifact exists (direct push, expired retention), it falls back to running the tests. - Scans are serialized per branch FIFO via `concurrency.queue: max` (GitHub Actions, 2026-05), so every merged commit is analyzed, in order, with no runs cancelled in merge bursts. - PR scans (Sonar PR analyses, no main-branch date ordering) and cache-warming runs are unchanged. actionlint (≤ v1.7.12) does not know the `queue` key yet, so lint.yml gains a narrow --ignore alongside the existing workspace one. A failed analysis upload also no longer involves the gate: scanner failures now surface on a post-merge run instead, while PR runs keep exercising the scanner so breakage is still caught before merge.

…r branch (erigontech#21669) ## Problem SonarCloud rejects a branch analysis dated older than the branch's latest processed one. Every merge-queue entry analyzes the target branch, so when queue entry N's report uploads *after* entry N+1's, SonarCloud rejects N's with *"Date of analysis cannot be older than the date of the last known analysis"* and flips that commit's **SonarCloud Code Analysis** check to `cancelled`. Seen on [`5027d259c2`](erigontech@5027d25): three stacked queue entries (erigontech#21648, erigontech#21584, erigontech#21640) scanned concurrently; erigontech#21584 started 6s later but uploaded 21s earlier, so erigontech#21648's analysis was rejected as out of order. No impact on the gate (the scan step is `continue-on-error` and the merge succeeded), but it leaves a spurious failed check on a merged commit. ## Fix Make it structurally impossible to submit branch analyses out of order: - **`merge_group` runs** keep `make test-sonar-coverage` as a gate check but **no longer scan**. They upload `coverage-test-all.out` as a 7-day artifact. - **A new `SonarCloud Branch Scan` workflow** (`push` on `main` + `release/**`, matching the branches SonarCloud tracks) downloads that artifact and runs **only the scanner** (~7 min vs ~25). The merge queue fast-forwards the target branch to the already-tested merge commit, so the pushed SHA equals the merge-queue run's head SHA and the artifact maps to the commit exactly. If no artifact exists (direct push, or expired retention), it falls back to running the tests itself. - Branch scans are serialized per branch **FIFO** via `concurrency.queue: max` ([GitHub Actions, 2026-05](https://github.blog/changelog/2026-05-07-github-actions-concurrency-groups-now-allow-larger-queues/)), so every merged commit is analyzed, in order, with **no runs cancelled** during merge bursts. - **PR scans** (Sonar PR analyses — separate from main-branch date ordering) and **cache-warming** runs are unchanged. A failed analysis upload no longer touches the gate at all: scanner failures now surface on a post-merge run, while PR runs keep exercising the scanner so config/scanner breakage is still caught before merge. ## Notes - `actionlint` (≤ v1.7.12, the latest release) does not know the `queue` concurrency key yet, so `lint.yml` gets a narrow `--ignore` alongside the existing `workspace` one. Removable once actionlint learns the key. - `concurrency.queue: max` is ~1 month old — first real merge-burst after this lands is where FIFO ordering gets exercised live. ## Testing - `make lint`, `actionlint` (CI flags), and `zizmor` (repo config) all clean; the new workflow is finding-free. - The artifact-lookup script was shellchecked and run against live API data: correctly resolves the merge-queue run by head SHA, filters expired artifacts, and degrades to the test-running fallback on a missing artifact or total API failure (never fails the job).

taratorio requested review from AskAlexSharov and yperbasis as code owners June 5, 2026 13:53

taratorio requested review from awskii and mh0lt June 6, 2026 03:04

Merge branch 'main' into taratorio/eest-race-shard-exit-code

33d9587

AskAlexSharov approved these changes Jun 6, 2026

View reviewed changes

AskAlexSharov added this pull request to the merge queue Jun 6, 2026

Merged via the queue into main with commit 64de483 Jun 6, 2026
90 checks passed

AskAlexSharov deleted the taratorio/eest-race-shard-exit-code branch June 6, 2026 06:31

taratorio mentioned this pull request Jun 8, 2026

ci: move SonarCloud branch scans out of the merge queue, serialize per branch #21669

Merged

yperbasis added the QA label Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools: fail EEST spec shards on race-detector exit code 66#21640

tools: fail EEST spec shards on race-detector exit code 66#21640
AskAlexSharov merged 2 commits into
mainfrom
taratorio/eest-race-shard-exit-code

taratorio commented Jun 5, 2026 •

edited

Loading

Uh oh!

taratorio commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

taratorio commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Verification (local repro)

Note on TDD

End-to-end verification with a real race in erigon code

Uh oh!

taratorio commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

taratorio commented Jun 5, 2026 •

edited

Loading