Skip to content

tools: fail EEST spec shards on race-detector exit code 66#21640

Merged
AskAlexSharov merged 2 commits into
mainfrom
taratorio/eest-race-shard-exit-code
Jun 6, 2026
Merged

tools: fail EEST spec shards on race-detector exit code 66#21640
AskAlexSharov merged 2 commits into
mainfrom
taratorio/eest-race-shard-exit-code

Conversation

@taratorio

@taratorio taratorio commented Jun 5, 2026

Copy link
Copy Markdown
Member

Problem

tools/run-eest-spec-test.sh invokes the cmd/evm runner with || true and decides shard pass/fail from the --jsonout report alone. That design is correct for the non-race shards (the runners always exit 0 and report via JSON; crashes surface as missing/truncated JSON, which the script catches via the total == 0 check and set -euo pipefail on jq).

It is wrong for the *-race-* shards: the Go race runtime reports detected races to stderr and overrides the process exit code to 66 (even through os.Exit(0)), while the run completes normally and the JSON stays fully valid with every test passing. The || true discards that exit code, so a race shard whose binary detected a data race still went green — the WARNING: DATA RACE report only visible in the job log. The race detector's whole value is catching races that don't happen to corrupt results, so this silently weakened the guarantee of every race shard:

blocktests-stable-race-{pre-cancun,cancun,prague,osaka}-{sequential,parallel}
blocktests-devnet-race-amsterdam

(and zkevm-witness-race-all once #21629 lands).

Pre-existing since the shards were introduced in #21092; surfaced while reviewing #21629, which doubles the corpus running under the race detector.

Fix

Capture the runner's exit code and fail the shard explicitly on 66 with a pointer to the race report above it in the log. All other exit-code tolerance is unchanged.

Verification (local repro)

Built a minimal -race Go binary containing a genuine data race that emits valid passing JSON ([{"name":"t1","pass":true}]) and calls os.Exit(0), then ran it through the real script as EVM_BIN against blocktests-stable-race-cancun-sequential:

  • Before (unmodified script): binary prints Found 1 data race(s) and exits 66 → script prints ran 1 tests, 0 failed and exits 0 (green despite the race).
  • After: same run exits 1 with ERROR: data race detected (race runtime exit code 66); see WARNING: DATA RACE in the log above.

Regression matrix with stub binaries through the patched script, confirming only the race case changes behavior:

Scenario Before After
race detected (exit 66, valid passing JSON) exit 0 exit 1
clean pass (exit 0) exit 0 exit 0
non-zero exit ≠ 66 with valid JSON exit 0 (tolerated by design) exit 0
test failures over budget exit 1 exit 1

shellcheck --severity=warning (the lint-workflow invocation) is clean on the modified script; make lint clean.

The race shards running on this PR exercise the patched path end-to-end in CI (race-free run → rc=0 → behavior identical to before).

Note on TDD

This is a shell-script CI guard with no test harness; the Red → Green cycle above (reproduce the masked exit code through the unmodified script, then flip it with the fix) stands in for a checked-in failing test.

End-to-end verification with a real race in erigon code

Repeated the verification with the full production pipeline instead of stub binaries: planted a deliberate data race in IntraBlockState.CommitBlock (two unsynchronized goroutines incrementing a package var, executed once per imported block — never committed), built the real evm.race via make evm.race, and ran a single real EEST fixture (cancun/eip1153_tstore/test_basic_tload_after_store.json, one fork_Cancun test) through the blocktests-stable-race-cancun-sequential shard:

  • old script (origin/main): race report correctly attributes intra_block_state.go in CommitBlock.func1, binary prints Found 1 data race(s) and exits 66 → script prints ran 1 tests, 0 failed, exit 0 — green despite the race.
  • this PR's script: same binary, same fixture → exit 1, last line ERROR: data race detected (race runtime exit code 66); see WARNING: DATA RACE in the log above.
  • control: race reverted, evm.race rebuilt clean, same fixture through this PR's script → 0 race warnings, ran 1 tests, 0 failed, exit 0 — no false positive.

The Go race runtime reports detected races to stderr and overrides the
process exit code to 66 (even through os.Exit(0)), while the run completes
normally and the JSON report stays valid. run-eest-spec-test.sh discarded
all exit codes with '|| true' and decided pass/fail from the JSON alone, so
a *-race-* shard whose binary detected a data race still passed unless the
race also corrupted a test result. Capture the exit code and fail the shard
explicitly on 66; all other exit-code tolerance is unchanged (crashes still
surface as missing/truncated JSON).
@taratorio

Copy link
Copy Markdown
Member Author

@awskii this is a bug from before that got spotted while reviewing #21629

@AskAlexSharov AskAlexSharov added this pull request to the merge queue Jun 6, 2026
Merged via the queue into main with commit 64de483 Jun 6, 2026
90 checks passed
@AskAlexSharov AskAlexSharov deleted the taratorio/eest-race-shard-exit-code branch June 6, 2026 06:31
taratorio added a commit that referenced this pull request Jun 8, 2026
…r branch

SonarCloud rejects a branch analysis dated older than the branch's
latest processed one. All merge-queue entries analyze the target branch,
so when entry N's report uploaded after entry N+1's, SonarCloud rejected
it and flipped the commit's SonarCloud check to cancelled ("Date of
analysis cannot be older than the date of the last known analysis").
Seen on 5027d25, where the scans of three stacked queue entries
(#21648, #21584, #21640) raced and #21648's analysis lost.

Restructure so branch analyses can never be submitted out of order:

- merge_group runs keep `make test-sonar-coverage` as a gate check but
  no longer scan; they upload coverage-test-all.out as a short-retention
  artifact instead.
- A new push-triggered SonarCloud Branch Scan workflow (main and
  release/**, matching the branches SonarCloud tracks) downloads that
  artifact — the merge queue fast-forwards the target branch to the
  already-tested merge commit, so the pushed SHA equals the merge-queue
  run's head SHA and the artifact maps to the commit exactly — and runs
  only the scanner (~7 min instead of ~25). If no artifact exists
  (direct push, expired retention), it falls back to running the tests.
- Scans are serialized per branch FIFO via `concurrency.queue: max`
  (GitHub Actions, 2026-05), so every merged commit is analyzed, in
  order, with no runs cancelled in merge bursts.
- PR scans (Sonar PR analyses, no main-branch date ordering) and
  cache-warming runs are unchanged.

actionlint (≤ v1.7.12) does not know the `queue` key yet, so lint.yml
gains a narrow --ignore alongside the existing workspace one.

A failed analysis upload also no longer involves the gate: scanner
failures now surface on a post-merge run instead, while PR runs keep
exercising the scanner so breakage is still caught before merge.
@yperbasis yperbasis added the QA label Jun 8, 2026
Sahil-4555 pushed a commit to Sahil-4555/erigon that referenced this pull request Jun 8, 2026
…r branch (erigontech#21669)

## Problem

SonarCloud rejects a branch analysis dated older than the branch's
latest processed one. Every merge-queue entry analyzes the target
branch, so when queue entry N's report uploads *after* entry N+1's,
SonarCloud rejects N's with *"Date of analysis cannot be older than the
date of the last known analysis"* and flips that commit's **SonarCloud
Code Analysis** check to `cancelled`.

Seen on
[`5027d259c2`](erigontech@5027d25):
three stacked queue entries (erigontech#21648, erigontech#21584, erigontech#21640) scanned
concurrently; erigontech#21584 started 6s later but uploaded 21s earlier, so
erigontech#21648's analysis was rejected as out of order. No impact on the gate
(the scan step is `continue-on-error` and the merge succeeded), but it
leaves a spurious failed check on a merged commit.

## Fix

Make it structurally impossible to submit branch analyses out of order:

- **`merge_group` runs** keep `make test-sonar-coverage` as a gate check
but **no longer scan**. They upload `coverage-test-all.out` as a 7-day
artifact.
- **A new `SonarCloud Branch Scan` workflow** (`push` on `main` +
`release/**`, matching the branches SonarCloud tracks) downloads that
artifact and runs **only the scanner** (~7 min vs ~25). The merge queue
fast-forwards the target branch to the already-tested merge commit, so
the pushed SHA equals the merge-queue run's head SHA and the artifact
maps to the commit exactly. If no artifact exists (direct push, or
expired retention), it falls back to running the tests itself.
- Branch scans are serialized per branch **FIFO** via
`concurrency.queue: max` ([GitHub Actions,
2026-05](https://github.blog/changelog/2026-05-07-github-actions-concurrency-groups-now-allow-larger-queues/)),
so every merged commit is analyzed, in order, with **no runs cancelled**
during merge bursts.
- **PR scans** (Sonar PR analyses — separate from main-branch date
ordering) and **cache-warming** runs are unchanged.

A failed analysis upload no longer touches the gate at all: scanner
failures now surface on a post-merge run, while PR runs keep exercising
the scanner so config/scanner breakage is still caught before merge.

## Notes

- `actionlint` (≤ v1.7.12, the latest release) does not know the `queue`
concurrency key yet, so `lint.yml` gets a narrow `--ignore` alongside
the existing `workspace` one. Removable once actionlint learns the key.
- `concurrency.queue: max` is ~1 month old — first real merge-burst
after this lands is where FIFO ordering gets exercised live.

## Testing

- `make lint`, `actionlint` (CI flags), and `zizmor` (repo config) all
clean; the new workflow is finding-free.
- The artifact-lookup script was shellchecked and run against live API
data: correctly resolves the merge-queue run by head SHA, filters
expired artifacts, and degrades to the test-running fallback on a
missing artifact or total API failure (never fails the job).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants