ci: matrix-test serial vs parallel exec across the test workflows by mh0lt · Pull Request #21017 · erigontech/erigon

mh0lt · 2026-05-06T17:39:42Z

2026-05-15 — #21153 has merged. The companion PR carrying the substantive parallel-exec correctness + perf fix family (mh/parallel-exec-fixes, merged 2026-05-15 02:35Z as 958b2fbb85) is now on main. This PR has been rebased onto fresh main; the trimmed branch contains only the serial/parallel exec-mode CI matrix yamls plus two CI hygiene fixes (a shared build-erigon-image job for the kurtosis matrix, and the setup-erigon apt-mirror flake fix), plus three workflow follow-ups from the 2026-05-13 review.

Landed precursors:

✅ execution: parallel-exec correctness + perf fixes; gate residual EXEC3_PARALLEL flakes (#21136) #21153 (merged 2026-05-15) — parallel-exec correctness/perf stack split from this PR

✅ execution/stagedsync, db/state: parallel-commitment correctness for reorg/unwind + SD recreate #21088 (merged 2026-05-10) — parallel-commitment correctness for reorg/unwind + SD recreate

✅ state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval #21032 (merged 2026-05-08) — wrong-trie-root on Cancun / SD paths

✅ execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit #21039 (merged 2026-05-08) — apply-loop completeness false-flag on partial-batch exit

✅ stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains #21010 (merged 2026-05-07) — commitmentCalculator.asOfReader.txNum=0 lazy-load fix

Open follow-up issues (separate from this PR's scope): #21106 (parallel-exec hardening), #21107 (stage-exec from-0/chaintip), #21108 (residual EXEC3_PARALLEL functional cases), #21136 (gated SkipLoads), #21138 (heuristic-removal structural cleanup).

Summary

Adds an exec_mode: [serial, parallel] matrix axis to every CI test workflow that exercises the EL execution path so divergence between dbg.Exec3Parallel true and false is caught on the PR rather than after release.

The toggle is plumbed via ERIGON_EXEC3_PARALLEL — envLookup in common/dbg/dbg_env.go:38 auto-prepends the ERIGON_ prefix to the source-side EXEC3_PARALLEL flag in common/dbg/experiments.go:81.

Why

Exec3Parallel = false is currently the source default on main. With no CI workflow setting the env var, every CI run today exercises only the serial path. The parallel path lands via --bal chains and tests like experimentalBAL, but the broad correctness signal (unit / race / hive / kurtosis / RPC integration) is single-mode. This PR makes both modes a per-PR signal.

Affected workflows

Always-on (matrix on every PR / dispatch / call):

Workflow	Runner	Parallelism	Cost
test-all-erigon.yml	GitHub-hosted (per-OS)	true parallel	wall-clock unchanged, +1× runner-min
test-all-erigon-race.yml	GitHub-hosted	true parallel	same
test-hive.yml	hive group	parallel if pool has slots	same (group is sized)
test-hive-eest.yml	hive group	parallel if pool has slots	same
test-kurtosis-assertoor.yml	`ubuntu-latest`	true parallel	same

Auto-fire on PRs touching their own YAML, dispatch otherwise (so regular PRs don't pay the cost; this PR triggers each one once for end-to-end validation):

Workflow	Notes
test-bench.yml	`go bench`
qa-rpc-integration-tests-latest.yml	self-hosted single-pool, `max-parallel: 1` (shared testbed state)
qa-rpc-performance-comparison-tests.yml	erigon runs serial+parallel, geth single-mode (placeholder ignored)
qa-txpool-performance-test.yml	kurtosis txpool, `max-parallel: 1`
qa-stage-exec.yml	3 mode_names × 2 exec_modes = 6 entries

Skipped — test-integration-caplin.yml runs cl/spectest only, doesn't exercise the EL exec path; matrix-doubling would just re-run identical CL tests.

Plumbing details

Workflows that build erigon and run go tests directly: env var set on the test step's env: block.
Hive-based workflows: an ENV ERIGON_EXEC3_PARALLEL=... line is appended to clients/erigon/Dockerfile during the existing sed-patch loop so every hive-launched erigon inherits the toggle.
Kurtosis-based workflows: a single upstream build-erigon-image job builds test/erigon:current-base once for the whole matrix and uploads it as a run-scoped artifact; each matrix entry downloads + docker loads and adds a cheap ENV ERIGON_EXEC3_PARALLEL=… layer on top to produce test/erigon:current. Avoids ~12 concurrent buildx jobs all writing the same type=gha cache scope and 504-ing.
Self-hosted single-pool workflows use max-parallel: 1 to serialize matrix entries cleanly when state on the runner box is shared (testbed datadir, reference datadir, etc.).
All artifact / enclave / testbed-dir names are disambiguated by exec_mode so the two matrix entries don't clobber each other's outputs.

Review follow-ups (2026-05-15 rebuild)

After rebasing onto post-#21153 main, three workflow fixes from the 2026-05-13 review are applied as separate commits on top of the 4 trimmed CI commits:

Commit	Fixes
`ci: gate parallel-suffix QA test_name on client==erigon`	yperbasis Blocker 2 + Copilot thread #4 — geth's placeholder `exec_mode: parallel` previously caused its Grafana `test_name` to gain a stray `-parallel-` suffix, breaking historical dashboard queries
`ci: align test-hive devp2p sim-limit between serial and parallel matrix legs`	yperbasis nit 5 + Copilot threads #3/#9 — both legs now run `sim-limit: eth` (discv5 doesn't exercise the EL exec path; matches the long-standing comment)
`ci: fix Targetting typo in test-hive-eest.yml`	Copilot thread #6 — s/Targetting/Targeting/

The yperbasis Blocker 1 (cmd/integration/commands/stages.go:631 ignores ERIGON_ prefix, so qa-stage-exec's serial entry silently runs in parallel mode) is a Go change, raised as a separate small PR to keep this one strictly .github/ only.

Test plan

This PR's CI run is the test. The 5 always-on workflows fire automatically on PR. The 5 perf workflows auto-fire because their pull_request: paths filter matches the workflow's own YAML — so opening this PR triggers all 10 affected workflows.

What to look for in this PR's "Checks" tab:

Each affected workflow has both serial and parallel matrix entries listed.
Job display names show the mode (e.g. tests-mac-linux (ubuntu-24.04, parallel)).
Wall-clock for hosted-runner workflows is essentially unchanged from main baseline (jobs ran concurrently on separate runners).
Self-hosted single-pool workflows show ~2× wall-clock (matrix entries serialize).
Both modes pass on every workflow. If serial fails where parallel passes (or vice versa), that's a real regression we'd want to catch — exactly the point of this change.

After merge, regular PRs only pay the matrix cost on the 5 always-on workflows; the 5 perf workflows go back to being dispatch / schedule-only.

mriccobene · 2026-05-07T09:23:54Z

Here, we need to change the test name to distinguish between different results in our Hive UI.
For sequential execution, it is preferable to leave the name as it is, and change it for parallel execution by adding a suffix. This allows us to use a simple regular expression to filter results; for example, 'rpc-integration-tests-latest-parallel'.

Fixed in 406cb45 — test_name now suffixes -parallel only for the parallel matrix entry, serial keeps the existing name unchanged. Renders as rpc-integration-tests-latest (serial) / rpc-integration-tests-latest-parallel (parallel).

mriccobene · 2026-05-07T09:30:20Z

Here, again, we need to change the test name to distinguish between different results in our Grafana dashboard
For sequential execution, it is preferable to leave the name as it is, and change it for parallel execution by adding a suffix. This allows us to use a simple regular expression to filter results; for example, 'rpc-performance-test-latest-parallel-$method'.

Fixed in 406cb45 — --test_name for upload_test_results.py now expands to rpc-performance-test-latest-$method (serial) / rpc-performance-test-latest-parallel-$method (parallel). The HDR-analysis report's local test_name (line 340) stays as $client-$method since that only labels the PDF, which is already disambiguated by the artifact-name's -${{ matrix.exec_mode }} suffix — let me know if you want that one renamed too for consistency.

mh0lt · 2026-05-07T11:17:20Z

Cross-ref to #21032 — the precursor that fixes the parallel-exec wrong-trie-root bugs this matrix surfaces. Once #21032 lands, rebase this on top and the parallel hive sub-suites should go green.

@mriccobene

Addresses review feedback from @mriccobene on PR #21017: the QA dashboard filters results by test_name regex. Without a suffix the parallel and serial entries land under the same test_name and can't be distinguished in the Hive UI / Grafana dashboard. Per the reviewer's preferred convention, leave the serial entry's name unchanged and add `-parallel` only for the parallel matrix entry. Both `upload_test_results.py` invocations are updated: - qa-rpc-integration-tests-latest.yml:266 → rpc-integration-tests-latest[-parallel] - qa-rpc-performance-comparison-tests.yml:374 → rpc-performance-test-latest[-parallel]-$method The HDR-analysis PDF test_name (line 340 of the perf-comparison file) is intentionally left as `$client-$method` — that name only labels a local report file, which is already disambiguated by the artifact name's `-${{ matrix.exec_mode }}` suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mh0lt · 2026-05-07T12:58:18Z

Second precursor open: #21039 — fixes the parallel-mode BenchmarkFeeHistory failure (Bucket B). Once #21032 and #21039 merge I'll rebase. Bucket C (reorg state-visibility) investigation continuing in a third precursor.

…oop exit (erigontech#21039) ## Summary Drops the fallback clause in `applyLoopMissingBlocks` that turned every legitimate partial-batch exit into a spurious `InvalidBlock` error. When the parallel exec loop hits its size-limit mid-fixture, it returns nil with `reachedMaxBlock=false`. The previous completeness check flagged `maxBlockNum` as "missing" whenever `!reachedMaxBlock` — but the size-limit path is normal: the apply loop returns `ErrLoopExhausted`, the stage loop resumes the next batch from `lastBlockResult+1`, and each block still executes exactly once. No re-execution. ## Update — additional fix in commit `310a094514` (`execution/stagedsync: honor blockResult.Exhausted in exec loop`) Same class of partial-batch orchestration bug, on the **exec-loop side**. `executeBlocks` marks the final dispatched block with `Exhausted` when the per-cycle block limit fires, then exits its goroutine — without closing `pe.execRequests`, without cancelling ctx. The exec loop had no branch to react to that signal: after the last blockResult was applied, the main select parked waiting for `execRequests` / `rws.ResultCh` / `ctx.Done` — none of which ever fire. The apply loop never got the channel-close signal it needs to return `ErrLoopExhausted`, so the original (apply-loop-side) fix in this PR couldn't even run. Reproduced as a chiado parallel-exec **silent stall** at block 150662 with `EXEC3_PARALLEL=true ... --sync.loop.block.limit=10_000` from block 0 (issue erigontech#20711). The hang was masking a wrong-trie-root divergence at the same block — with the exhaust signal honored, the underlying `ErrWrongTrieRoot` now surfaces cleanly through the apply loop and the original-issue symptom appears as expected. The new branch sits between the `maxBlockNum`-reached check and the `StopAfterBlock` debug exit, matching the precedence of the existing partial-batch exits. Tests: `TestExecLoopHonorsBlockResultExhausted` runs **two models** of the exec-loop blockResult decision tree (post-fix and pre-fix) against identical channel orchestration. The pre-fix model must hang past the timeout — proves the test scaffolding genuinely surfaces the bug rather than passing vacuously. `TestExecLoopExhaustedOnlySetOnFinalBlock` locks in the dispatcher convention so future `executeBlocks` changes can't silently drop already-queued work by setting `Exhausted` mid-stream. ## Why Reproduces deterministically as the parallel-mode `BenchmarkFeeHistory` failure on PR erigontech#21017 (the CI matrix that runs every test under both `serial` and `parallel` exec modes): ``` apply loop exited (reachedMaxBlock=false lastBlockResult=114 maxBlockNum=200) but 1 block(s) had tx-results without a blockResult or were never delivered: [200] ``` 200 blocks × 100 simple transfers exceed the test fixture's 5MB batch budget at block 114. Trace confirms: - Batch 1: blocks 1..114, size-limit fires, `ErrLoopExhausted` - Batch 2: blocks 115..200, `reachedMaxBlock=true`, clean nil Sd.mem is not contaminated by in-flight future-block writes — `txResultBlocks` at exit was exactly `[1..114]`. Block 200 was flagged solely from the fallback clause. ## Shutdown sequence verified Tracing the deferred-close in `execLoop`: `commitResults` closes first, calculator drains and closes `rootResults`, apply loop sees both closes, drains, runs the completeness check, returns `ErrLoopExhausted`. `executorCancel` then cancels `execLoopCtx`, `executeBlocks` and workers exit, `pe.wait` joins. Three batches in `BenchmarkFeeHistory` run end-to-end with no goroutine leaks. ## Tests * `TestApplyLoopMissingBlocks` — adds a "partial batch — size-limit hit, no spurious flag for unreached blocks" case * `TestApplyLoopPartialBatchReturnsErrLoopExhausted` — exercises the apply-loop exit decision tree (exhausted vs nil vs InvalidBlock) for partial-batch / full-batch / silent-failure scenarios * `TestApplyLoopChannelCloseOrder` — pins the load-bearing `commitResults`-before-`applyResults` close order * `TestExecLoopHonorsBlockResultExhausted` — orchestration test for the exec-loop side of the partial-batch exit (added in commit `310a094514`) * `TestExecLoopExhaustedOnlySetOnFinalBlock` — pins the executeBlocks dispatcher convention (added in commit `310a094514`) ## Test plan - [x] `BenchmarkFeeHistory/full/blocks=200` passes with `ERIGON_EXEC3_PARALLEL=true` - [x] `make lint` clean - [x] `go test -race ./execution/stagedsync/...` for new tests passes - [x] Existing `TestApplyLoopRootResultsCloseDoesNotRace`, `TestApplyLoopDoesNotHangAfterRootResultsClose`, `TestExecLoopExitCheck*` still pass - [x] Chiado `EXEC3_PARALLEL=true ... --chain=chiado --sync.loop.block.limit=10_000 --prune.mode=archive` from block 0 reaches the underlying `ErrWrongTrieRoot` at block 150662 instead of silently stalling (verifies the partial-batch exec-loop exit fires) ## CI fix linkage Precursor for erigontech#21017 (CI matrix `exec_mode={serial,parallel}`). Without this, `bench / benchmarks (parallel)` and any other workflow exercising a chain large enough to cross the batch-size boundary fails on first crossing. Companion to erigontech#21032 (incarnation/SD differentiator fix).

mh0lt · 2026-05-08T15:21:55Z

Merged origin/main into the branch (commit 7f15ca7cea) to pick up the parallel-exec precursors that landed today: #21039 (apply-loop), #21062 (Set*Workers race), #21042 (stateCache wiring), #21010 (asOfReader.txNum=0 fix), and #21065 (devp2p discv5).

One conflict: .github/workflows/test-hive.yml — main bumped devp2p sim-limit from eth to eth|discv5 (#21065) at the same time this branch was expanding the row into serial+parallel exec_mode variants. Resolved by keeping both serial and parallel devp2p rows and applying eth|discv5 to both.

Still missing: #21032 (SD vs EIP-161 emptyRemoval wrong-root). That's the only remaining precursor; once it lands, this branch will rebase to pick it up and we'll have the full set. Pushing now to see how the parallel-mode hive sub-suites land with the today-precursors but without #21032 — expect SD-related sub-suites (rpc-compat selfdestruct, eest selfdestruct, cancun system-call address) to fail on the exec_mode: parallel rows; the rest should give a useful signal.

mh0lt · 2026-05-08T16:57:35Z

Merged origin/main (now including #21032 — landed at 16:51Z) into the branch as commit 1075b33acd. The full parallel-exec precursor stack is now in: #21039, #21042, #21046, #21062, #21010, #21065, #21032.

No conflicts on the merge — #21032's diff (calc_state.go, calc_state_test.go, exec3_parallel.go, rw_v3.go) is orthogonal to #21017's workflow-yaml-only diff.

Lint clean. CI now exercises the parallel-exec hive sub-suites with the SD/EIP-161 emptyRemoval fix in place — this is the meaningful go/no-go signal. Watching for parallel-mode rows on:

hive / test-hive (ethereum/engine, cancun)
hive / test-hive (ethereum/engine, withdrawals)
hive / test-hive (ethereum/rpc-compat, .*)
race-tests / tests-linux (ubuntu-latest, execution-eest-devnet)

…tyRemoval (erigontech#21032) ## Summary Fixes a wrong trie-root in the parallel commitment calculator when post-tx writesets are indistinguishable between two cases: 1. **SELFDESTRUCT of a pre-existing contract** — serial keeps the leaf with `incarnation>0`, zero balance/nonce/empty codeHash. Parallel was emitting `DeleteUpdate` and removing the leaf. 2. **EIP-161 emptyRemoval** of a touched-empty account (e.g. `0xff…fe` after the EIP-4788 system call) — serial emits `DeleteUpdate`. Parallel was emitting a zero-account UPDATE. The discriminator is the **pre-block incarnation**, which the writeset alone didn't carry. Fix wires it through `LightCollector.DeleteAccount` → `IncarnationPath` write → `calcAccountState.Incarnation` → 3-way branch in `FlushToUpdates`. ## Dependency direction This PR is a **precursor for erigontech#21017** (the CI matrix that runs every test under both `serial` and `parallel` exec modes). Without this fix, parallel-mode tests on hive `rpc-compat`, `engine api/cancun/withdrawals`, and the eest selfdestruct sub-suites all fail with wrong-trie-root errors at SD/empty-removal blocks. **erigontech#21017 cannot land until this PR lands.** The matrix in erigontech#21017 will validate this PR end-to-end via the hive parallel sub-suites — meaning this PR's parallel-exec changes don't run in *this* PR's own CI, only in erigontech#21017's. The rebased erigontech#21017 is the meaningful go/no-go signal. ## Changes * `LightCollector.DeleteAccount` now emits `IncarnationPath` alongside `SelfDestructPath=true` when `original.Incarnation > 0`. Calc receives the pre-deletion incarnation through the same channel as every other write — no exec-side state leakage. * `calc_state.ApplyWrites` consumes `IncarnationPath` into `calcAccountState.Incarnation` via direct type-assertion (panic on mismatch — see *Concern 3* below). * `calc_state.FlushToUpdates` switches on a 3-way `Deleted` branch with `isAllZero` defense-in-depth: * `Deleted && Incarnation>0 && all-zero` → zero-account UPDATE (matches serial's `DomainDel`-leaves-post-CREATE-encoding for SD'd contracts) * `Deleted && all-zero && Incarnation==0` → `DeleteUpdate` (matches serial's `DomainDel`-truly-empties for EIP-161 emptyRemoval) * `Deleted` with retained non-zero values → regular UPDATE — defensive coverage for OOG-CREATE2-with-retained-balance and any future write-ordering race * `SelfDestructPath` also marks all tracked storage slots dirty so `FlushToUpdates` emits per-slot updates alongside the account reset. Load-bearing invariant: `normalizeWriteSet`'s `vm.StorageKeys(addr)` loop emits `StoragePath=0` entries that arrive in `ApplyWrites` after `SelfDestructPath`, overwriting the marked slots' values to zero so they emit `DeleteUpdate` not `StorageUpdate(pre-SD value)`. Inline comment in `calc_state.go` spells this out — see *Concern 2* below. ## Earlier draft snags (resolved) The first draft also added an `IncarnationPath > 0` exclusion to `normalizeWriteSet`'s empty-account check. This was **redundant** (the empty-check already requires `Nonce == 0`, which excludes successful CREATE/CREATE2) and **broke OOG-during-CREATE2 cases** (which leave `Nonce=0/Balance=0/Code=empty/Incarnation=1` and *must* still be deleted). Removed in `9539998f14`. The `exec3_parallel.go` diff in this PR is now comments-only. ## Reviewer concerns addressed ### #1 (yperbasis): PR description was stale This body. ✓ ### #2 (yperbasis + Copilot): SD storage-cascade load-bearing invariant Inline comment in `calc_state.go`'s `SelfDestructPath` case now spells out the dependency on `normalizeWriteSet`'s `vm.StorageKeys(addr)` loop. ✓ ### #3 (yperbasis + Copilot): IncarnationPath guarded type-assertion Changed to direct `w.Val.(uint64)` to match the other paths. Silent zero would route a real SD into the EIP-161 branch and reproduce the very wrong-root bug — better to panic at the source. ✓ ### #4 (yperbasis): `TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate` docstring Updated to clarify: this is **defensive coverage** for the third `FlushToUpdates` branch in isolation, NOT a direct repro of the eest_devnet OOG path. The actual OOG fix is the removal of the redundant `IncarnationPath > 0` clause from `normalizeWriteSet` (the OOG writeset has `Nonce=0` → empty-account → `DeleteUpdate`, not `Deleted+RetainedBalance`). End-to-end coverage of that path lives in the eest_devnet suite, surfaced via erigontech#21017's matrix. ✓ ### #5 (yperbasis): `versionedWriteCollector.DeleteAccount` asymmetry — *intentional non-fix* Decision: **keep the asymmetry, document why.** Inline comment added on `versionedWriteCollector.DeleteAccount` explaining: * It's wired only into `txResult.finalize` (fee calc + post-Cancun system calls). * Neither path SDs a pre-existing contract today, so the SD-with-incarnation differentiator is unreachable from here. * If a future caller ever does emit `DeleteAccount` on a pre-existing contract through this collector, the comment flags that this code should mirror `LightCollector.DeleteAccount`'s `IncarnationPath` emit. Adding the emit speculatively was rejected because: (a) it changes the writeset shape for paths that today don't need it, (b) any test exercising the new emit would be vacuous since no production caller hits the `original.Incarnation > 0` branch, and (c) the comment is enough to attribute the bug at first sight if someone *does* reach that code path in the future. ## Intentional non-fixes * **Concern #5 above** — `versionedWriteCollector.DeleteAccount` left without the `IncarnationPath` emit (rationale above). * **Defensive `TestFlushToUpdates_DeletedWithRetainedBalance` test kept** despite the state being unreachable from real LightCollector writesets today — protects the FlushToUpdates branch in isolation against future ApplyWrites refactors that might drop the `BalancePath`-clears-`Deleted` invariant. ## Test plan - [x] All 6 unit tests in `calc_state_test.go` pass (`TestFlushToUpdates_DeletedWithIncarnation_EmitsZeroAccountUpdate`, `TestFlushToUpdates_DeletedWithoutIncarnation_EmitsDelete`, `TestFlushToUpdates_DeletedWithRetainedBalance_EmitsRegularUpdate`, `TestFlushToUpdates_LiveAccount_EmitsFullUpdate`, `TestApplyWrites_IncarnationPath`, `TestApplyWrites_BalancePathClearsDeleted`) - [x] eest_devnet `for_amsterdam/constantinople/eip1052_extcodehash/extcodehash/extcodehash_subcall_create2_oog` all 6 variants pass locally - [x] Full `for_amsterdam/constantinople` eest_devnet suite passes - [x] `make lint` clean - [x] CI on `9539998f14` was green End-to-end validation comes via erigontech#21017's CI matrix once it rebases on top of this PR.

mh0lt · 2026-05-10T16:07:45Z

Bucket C parallel-commitment correctness fix is up in #21088 — closes the off-by-one cluster (TestBlockchainHeaderchainReorgConsistency, TestLongerForkHeaders, TestLongerForkBlocks, TestCallTraceUnwind) and TestRecreateAndRewind on the parallel-exec matrix. Remaining parallel-mode failures (pipeline-level races in forkchoice/senders/canonical-txnums and from-0 stage-exec) are tracked as separate follow-up PRs.

…ComputeCommitment swap Closes ~93% of the race-detector hits surfaced on #21017's parallel race-test matrix by extending the changesetMu band-aid (added in the prior commit) to cover the deferred-update flush window and the ComputeCommitment internal [state] marker write. Race count delta (parallel race-tests groups): Group Before After Δ ----------------------- ------- ----- ----- execution-other 12 2 -83% core-rpc 158 13 -92% execution-eest-blockchain 4 0 -100% Bucket C cluster 4 0 -100% What changed ------------ * `SharedDomains.FlushPendingUpdates` previously did its inner swap (`switcher.SetChangesetAccumulator(cs)` … restore) directly via the `changesetSwitcher` interface, bypassing changesetMu. The race detector flagged this as the dominant race signature (SetDiff↔SetDiff, SetChangesetAccumulator↔SetChangesetAccumulator, GetChangesetAccumulator↔SetChangesetAccumulator). Split into `FlushPendingUpdates` (acquires changesetMu) and `FlushPendingUpdatesLocked` (caller already holds it). The internal `flushPendingUpdates(lockHeld bool)` is the single implementation. * Added `ComputeCommitmentLocked` for the parallel calculator's per-block compute window. The calc holds changesetMu via `LockChangesetAccumulator` and now calls `ComputeCommitmentLocked` which uses `FlushPendingUpdatesLocked` instead of the public flush. * The `cs == nil` early-return path in `computeWithBlockAccumulator` also takes the lock now — the FlushPendingUpdates inside ComputeCommitment still runs there and needs serialization against the apply-side SetChangesetAccumulator. Without this the cs==nil fast path produces ~73 SetDiff vs PutWithPrev hits per group. * Added `domainPutNoLock` as an internal helper used by FlushPendingUpdates' deferred-branch replay, since `SharedDomains.DomainPut` for non-CommitmentDomain takes the lock and would self-deadlock when the caller already holds it. Bucket C functional tests (no -race) still all pass under EXEC3_PARALLEL=true. make lint clean. Residual races (~2 per execution-other run) are encodeAndStoreCommitmentState's [state] marker write going through the lock-exempt CommitmentDomain branch in DomainPut — closing those needs the [state] marker write to also be deferred (or the exemption removed). Tracked as Variant 1 / lock-free follow-up in the parallel-exec hardening series. This commit is intended to fold into #21088 since it's a related fix on the same band-aid mutex and #21088 hasn't been reviewed yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mh0lt · 2026-05-11T11:56:40Z

Parallel-exec hardening follow-ups (the long tail of making this matrix fully green in parallel mode):

Lock-free parallel execution: derive per-block changesets post-hoc, remove changeset accumulator from the exec path #21106 — lock-free parallel execution refactor (delete the changesetMu band-aid from execution/stagedsync, db/state: parallel-commitment correctness for reorg/unwind + SD recreate #21088)
Parallel-exec hardening: stage-exec-test (from-0 / chaintip / resume-nonchaintip) fails in both serial and parallel modes #21107 — stage-exec-test (from-0 / chaintip / resume-nonchaintip) fails in both modes
Parallel-exec hardening: residual functional test failures under EXEC3_PARALLEL=true unrelated to commitment-accumulator races #21108 — residual functional failures under EXEC3_PARALLEL=true (engine-API / RPC / shutter / BAL buckets)

#21088 is the immediate prerequisite (Bucket C correctness + 227→0 race-detector hits).

@mriccobene

Addresses review feedback from @mriccobene on PR #21017: the QA dashboard filters results by test_name regex. Without a suffix the parallel and serial entries land under the same test_name and can't be distinguished in the Hive UI / Grafana dashboard. Per the reviewer's preferred convention, leave the serial entry's name unchanged and add `-parallel` only for the parallel matrix entry. Both `upload_test_results.py` invocations are updated: - qa-rpc-integration-tests-latest.yml:266 → rpc-integration-tests-latest[-parallel] - qa-rpc-performance-comparison-tests.yml:374 → rpc-performance-test-latest[-parallel]-$method The HDR-analysis PDF test_name (line 340 of the perf-comparison file) is intentionally left as `$client-$method` — that name only labels a local report file, which is already disambiguated by the artifact name's `-${{ matrix.exec_mode }}` suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mh0lt · 2026-05-11T16:13:47Z

Rebased onto current main (now includes #21088, which closed the parallel-commitment correctness cluster + 227→0 race-detector hits across the four parallel race-test groups). Branch is now 2 clean commits on top of main.

Expected remaining parallel-mode failures after this rebase, all tracked as separate follow-ups:

Parallel-exec hardening: stage-exec-test (from-0 / chaintip / resume-nonchaintip) fails in both serial and parallel modes #21107 — stage-exec-test (from-0 / chaintip / resume-nonchaintip) fails in both serial and parallel, so non-Bucket-C
Parallel-exec hardening: residual functional test failures under EXEC3_PARALLEL=true unrelated to commitment-accumulator races #21108 — residual functional fails under EXEC3_PARALLEL=true (engine-API / RPC / shutter / BAL buckets) — not race-detector hits

Will triage the fresh CI run and update here.

…eorg/unwind + SD recreate (erigontech#21088) ## Summary Closes the off-by-one wrong-trie-root cluster, `TestRecreateAndRewind`, and **all** `TemporalMemBatch.currentChangesAccumulator` / `DomainBufferedWriter.diff` race-detector hits surfaced on erigontech#21017 under `EXEC3_PARALLEL=true`. Six commits: 1. `8c9d9c10a7` — Per-block commitment in calculator when generating changesets 2. `acba3279a6` — Lock `pastChangesAccumulator` for parallel-commitment access 3. `bac3ee0c25` — Hash-aware past-changeset lookup for parallel calculator 4. `6dc4e3fe8f` — Serialize accumulator swap; SD-aware writeset normalization 5. `980de89157` — Simplify calc SD `ApplyWrites` — single pass with conditional `Deleted`-clearing 6. `29f5ebc2ed` — Lock-protect `FlushPendingUpdates` + `ComputeCommitment` swap (closes 100% of the dominant race signature) ### Functional fix (commits 1–5) - **Concurrency band-aid on `SharedDomains.changesetMu`.** The parallel commitment calculator briefly swaps the global current-changeset-accumulator pointer to route block N's branch writes into block N's saved CS. During that swap window, the apply goroutine's `DomainPut` for block N+1's account/storage writes was landing in block N's CS, causing missing prev-value entries on unwind (`TestBlockchainHeaderchainReorgConsistency`, `TestLongerForkHeaders/Blocks`, `TestCallTraceUnwind`). - **SD-aware writeset normalization.** `IBS.Selfdestruct` emits three writes (`IncarnationPath`, `SelfDestructPath=true`, `BalancePath=0`). `normalizeWriteSet`'s completion loop was filling missing fields for SD'd addresses via the stateReader, round-tripping pre-SD nonce/codeHash back into the writeset. `applyVersionedWrites` then took the cleanup-before-recreate branch instead of pure-delete, so phoenix stayed in `sd.mem` with non-zero incarnation and the next block's `CREATE2` saw a phantom existing account. Calc-side `ApplyWrites` also had the symmetric bug: the trailing `BalancePath=0` clobbered `Deleted=true` set by `SelfDestructPath`. Fix: drop SD'd addresses' raw account-field writes in normalize; in calc, gate `Deleted`-clearing on a non-zero incoming value, and zero `Balance`/`Nonce`/`CodeHash`/`Incarnation` and storage on `SelfDestructPath`. ### Race fix (commit 6) The band-aid mutex from commit (4) only covered the calculator's outer swap and apply-side `DomainPut`/`DomainDel`. `FlushPendingUpdates` (which runs inside `ComputeCommitment`) was doing its own inner swap via the `changesetSwitcher` interface, bypassing `changesetMu`. That left the dominant race signature in erigontech#21017's parallel race-tests open (227 hits across the four matrix groups, all in the `currentChangesAccumulator` / `DomainBufferedWriter.diff` family). Commit (6) splits `FlushPendingUpdates` into the public locking variant and an unlocked `FlushPendingUpdatesLocked` for the calc, adds `ComputeCommitmentLocked` for the calc's per-block compute window, and extends the calculator's outer lock to the `cs == nil` early-return path so the inner `FlushPendingUpdates` is always serialized against the apply loop. | Race-tests group | Baseline races | After this PR | Δ | |---|---:|---:|---:| | execution-tests | 53 | **0** | -100% | | execution-other | 12 | **0** | -100% | | core-rpc | 158 | **0** | -100% | | execution-eest-blockchain | 4 | **0** | -100% | | Bucket C cluster (under -race) | 4 | **0** | -100% | | **Total** | **231** | **0** | **-100%** | ## Test plan - [x] All seven Bucket C tests pass under `EXEC3_PARALLEL=true` `-count=2`: - `TestBlockchainHeaderchainReorgConsistency` - `TestLongerForkHeaders` / `TestLongerForkBlocks` - `TestCallTraceUnwind` - `TestTxLookupUnwind` - `TestLowDiffLongChain` - `TestRecreateAndRewind` - [x] Bucket C cluster under `-race` shows 0 races (was 4 before this PR). - [x] All four parallel race-tests matrix groups under `-race` show 0 races (was 227 before this PR). - [x] `make lint` clean. - [x] No regressions in the surrounding parallel-exec test groups. ## Known follow-ups (NOT in scope for this PR — separate parallel-exec hardening track) - **Functional test failures unrelated to commitment-accumulator races** remain in some parallel-mode test groups (~20 unique tests across core-rpc, execution-other, etc.). These are NOT race-detector hits — they're pre-existing functional issues unrelated to the commitment-accumulator fixes (engine-API behavior, RPC test fixtures, etc.). Tracked as separate follow-up PRs in the parallel-exec hardening series. - **`stage-exec-test (from-0, parallel | serial)` and `(chaintip, parallel | serial)`** — failures span both modes, indicating a non-Bucket-C concern. Separate PR. - **Lock-free execution refactor** (Variant 1 / Option A0 in the project's session memory) — eventually delete the `changesetMu` band-aid and the swap dance in `committer.go computeWithBlockAccumulator` by deriving per-block changesets post-hoc from sd entries at flush time. Architectural rather than band-aid. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ode-matrix

Reverts b9f6208. This branch is the CI-matrix config and should carry only .github/ changes; the parallel-exec warmup-cache fix moves to a dedicated Go-fix PR.

…this branch" This reverts commit f02c5dd.

…head-extending payloads ValidateChain creates a fresh SharedDomains with no parent. FCU's MergeExtendingFork leaves the latest canonical state in currentContext.mem; MDBX is committed only later under memory pressure. Between an FCU and the next newPayload the fresh doms reads stale MDBX, so parallel fork-validation computes a wrong trie root — nondeterministic, depending on whether a background commit has fired (hive engine-api "Re-Org Back into Canonical Chain ... Execute Side Payload on Re-Org" failed at block 14/15 across runs). Set the doms parent to currentContext when the payload extends the canonical head, so validation sees the fresh in-memory state. Head-extending only: a fork payload needs unwindToCommonCanonical to revert doms to the common ancestor, which the parent link would shadow. Temporary on this CI-matrix branch to keep parallel CI green; cherry-picked from mh/perf-caches-pr (733dad5).

…arnation+1 Parallel exec runs tx bodies with shouldDelayFeeCalc=true, so the worker never credits the coinbase tip / burnt fee — finalizeTxSimple authors those as implicit writes. They were stamped with the worker's own Version, so a later tx that speculatively read the coinbase/burnt before this finalize ran recorded a version the validator could not tell apart from a fresh read (checkVersion compares (TxIndex,Incarnation) only). The stale read passed validation, the dependent tx was never re-executed, and the prior tx's fee was silently dropped — surfacing as an intermittent wrong trie root under the parallel-exec CI matrix. Stamp the implicit coinbase/burnt fee writes at incarnation+1 so they are a distinct version: a dependent that read the pre-fee value now fails validation and is re-executed against the post-fee balance.

The parallel exec path never sets a per-task gas pool, so the worker's preCheck saw st.gp==nil and CheckBlockGasInclusion silently short- circuited (the nil-guard in gaspool.go was added so workers couldn't race on the shared block pool — but that left no in-worker gas-allowance check at all). For a tx whose gas exceeds the block limit, preCheck then advanced to the next check; if the tx also had feeCap < baseFee, the worker returned ErrFeeCapTooLow. Serial preCheck catches the same tx earlier with ErrGasLimitReached. The merge-queue's hive-eest parallel matrix (cancun/prague/paris- shanghai/osaka) was failing on this divergence: eest asserts the serial error variant (GAS_ALLOWANCE_EXCEEDED) and rejects the parallel one (INSUFFICIENT_MAX_FEE_PER_GAS). Make TxTask.GasPool() hand out a fresh per-invocation pool sized to the block gas limit when no pool has been set. Workers still can't share the shared block pool (that one is consumed in the post-execution validation loop), but each worker now has a real pool for in-worker preCheck — CheckBlockGasInclusion fires for "tx alone exceeds the block limit" exactly as in serial. Pool depletion across multiple txs in the same block stays caught by the validation loop's CheckBlockGasInclusion against the shared pool. Serial is untouched: ResetGasPool sets t.gasPool before execution, the new lazy path is skipped.

…lock-validation wins The parallel apply loop pulls block results and per-block commitment results from two channels concurrently. Either side can produce an error: a blockValidator (post-execution receipts/gas/bloom checks) returns ErrInvalidBlock from the applyResults branch; the commitment calculator returns ErrWrongTrieRoot via rootResults. With both running in parallel, the apply-loop select picks whichever fires first, so a trie-root mismatch can win the race against a more specific block-validation failure for the same (or a later) block. Serial processes validation strictly before commitment, so its error ordering is deterministic and aligns with eest's validation-error taxonomy (the test expects the specific block-level error, not the downstream trie consequence). Stash ErrWrongTrieRoot in deferredRootErr and keep draining. Surface it only after applyResults closes (post-drain, post-missing-blocks check) so any block-validation error that fires meanwhile returns first. Other commitment errors (non-trie-root) stay fast-fail. The rotation in handleIncorrectRootHashError still runs at the moment the calculator detects the mismatch — only the *error* is deferred, not the side effect. Precaution rather than a known failure cause; it just enforces an invariant that's currently implicit in the parallel-vs-serial contract.

…as ErrInvalidBlock For txs whose TxTask.Reset rejects the message (e.g. an EIP-7702 SET_CODE tx with an empty authorization list — eest's fork_Prague test_empty_authorization_list), the worker returns a TxResult whose Err is the raw decoding/validation error, not a wrapped ErrExecAbortError. nextResult's "non-ErrExecAbortError" branch was returning (nil, err), which exits the exec loop silently — no blockResult ever reaches the apply loop. The apply-channel-closed branch then sees blks=0 and fabricates ErrLoopExhausted, which the stage loop reports as "unexpected state step has more work" and the engine API mis-reports as a state-machine error instead of the block-validation failure. Serial, by contrast, wraps the same raw err with rules.ErrInvalidBlock and rejects the block cleanly. Match serial: emit the failure through blockResult.Err (the apply loop's canonical block-validity error path). The apply loop sees applyResult.Err, marks the block applied so the completeness check is satisfied, and returns ErrInvalidBlock with the original err preserved in the message. Engine API now reports the real reason.

…ad of returning 'previously known bad block' When an earlier ReportBadHeader populates the badHeaders cache for a hash (or one of its ancestors) without a validation message — the err pointer was nil at the call site — every subsequent newPayload for that hash hit the cache, fell into the `cachedErr == ""` branch, and returned the generic "previously known bad block" string. The cache write at the same site then overwrote any future re-derivation with that same fallback, permanently degrading the entry. eest's ErigonExceptionMapper buckets the response by the validation error string — "previously known bad block" maps to no category, so the test reports "Undefined exception message" even when the underlying block has a deterministic, specific rejection reason (INVALID_BLOCK_ACCESS_LIST, INITCODE_SIZE_EXCEEDED, etc.). Issues #21363 + #21364 Mode A. On a cache hit with empty cachedErr, fall through to the normal validation pipeline so the rejection category is re-derived. The proper string then gets re-cached via the ReportBadHeader at line 1017 on the BadBlock path. Cache hits with a real cached message still take the fast path unchanged.

…ectChange journal.go defined two journal entry types for the same logical "createObject" event with non-symmetric dirtied() returns: func (ch createObjectChange) dirtied() (accounts.Address, bool) { return ch.account, true } func (ch resetObjectChange) dirtied() (accounts.Address, bool) { return accounts.NilAddress, false } createObject in intra_block_state.go picks between them on `previous == nil` — first-creation goes through createObjectChange, recreation (e.g. SD-revival via GetOrNewStateObject) goes through resetObjectChange. Both represent the same operation ("a stateObject was placed at this address"); they differ only in revert behaviour. The asymmetry bites parallel-exec when tx1 selfdestructs an address and tx2 hits CreateAccount or GetOrNewStateObject on the same address: * Serial: tx1's writer already DeleteAccount'd the addr, so getStateObject returns nil → createObject(addr, nil) appends createObjectChange → marks journal.dirties. * Parallel: versionedRead returns the contract's base-state account and reads tx1's SelfDestructPath=true; createObject synthesises a non-nil previous with selfdestructed=true → appends resetObjectChange → with the old return, does NOT mark journal.dirties. At MakeWriteSet the worker IBS computes isDirty from journal.dirties. With no mark, updateAccount falls through both DELETE and UPDATE branches → LightCollector.UpdateAccountData never fires → result.CollectorWrites is missing the empty-account write (test_double_kill / EEST EIP-6780 family on the parallel shard). Symmetric tracking restores the dirty mark for the recreate path without changing first-create behaviour. Verified on TestEngineApiBAL*, TestEIP7708BurnLog*, TestDeleteRecreate* under EXEC3_PARALLEL=true. ## Known regression — see #21217 TestSelfDestructReceive fails under EXEC3_PARALLEL=true after this change with a wrong-trie-root for block 1. The validator's stateObject reconstruction for an SD-then-revived address emits different field values (`nonce=1, codehash=emptyHash`) than the unfixed canonical (`nonce=0, codehash=<nil>`). The empty-touch / CreateAccount paths the fix addresses don't have this issue; the AddBalance(non-zero) on an SD'd address does. Filed as #21217 with full repro and the two failed narrow-fix attempts (unconditional symmetry; conditional SD-revival + SelfDestructPath=false re-emit). Lands as the last commit in this stack so the broader structural direction is visible together; the TestSelfDestructReceive regression is the explicit "more work needed" marker before final merge. Stacks on #21212.

anacrolix · 2026-05-25T04:43:53Z

It might worthwhile to only run serial in the merge queue. Parallel should be the more useful of the tests, just breaking serial is unlikely to happen on a PR. This will keep the highest value, frequently run jobs on PR, and keep the longer, meatier, less fragile ones as the final guardian.

taratorio · 2026-05-25T04:57:38Z

It might worthwhile to only run serial in the merge queue. Parallel should be the more useful of the tests, just breaking serial is unlikely to happen on a PR. This will keep the highest value, frequently run jobs on PR, and keep the longer, meatier, less fragile ones as the final guardian.

my view is that everything should be run on the PR, otherwise you find out about certain issues in the merge queue only which is quite late in the process and slows down development work. even hive-eest can be now run on every PR since they take <30mins

anacrolix · 2026-05-25T05:04:13Z

It might worthwhile to only run serial in the merge queue. Parallel should be the more useful of the tests, just breaking serial is unlikely to happen on a PR. This will keep the highest value, frequently run jobs on PR, and keep the longer, meatier, less fragile ones as the final guardian.

my view is that everything should be run on the PR, otherwise you find out about certain issues in the merge queue only which is quite late in the process and slows down development work. even hive-eest can be now run on every PR since they take <30mins

That's not unreasonable. Take a look at the diagram in https://github.com/erigontech/erigon/blob/main/CI-GUIDELINES.md. In particular, the recommendation is to run things in the merge queue CI if they're tests that can be easily reproduced locally. I think the workflows that have the serial/parallel matrix here are all easily reproduced, so doing the final screening only in the merge queue will reduce CI runner load. 🤷🏻

taratorio · 2026-05-25T05:10:17Z

It might worthwhile to only run serial in the merge queue. Parallel should be the more useful of the tests, just breaking serial is unlikely to happen on a PR. This will keep the highest value, frequently run jobs on PR, and keep the longer, meatier, less fragile ones as the final guardian.

my view is that everything should be run on the PR, otherwise you find out about certain issues in the merge queue only which is quite late in the process and slows down development work. even hive-eest can be now run on every PR since they take <30mins

That's not unreasonable. Take a look at the diagram in https://github.com/erigontech/erigon/blob/main/CI-GUIDELINES.md. In particular, the recommendation is to run things in the merge queue CI if they're tests that can be easily reproduced locally. I think the workflows that have the serial/parallel matrix here are all easily reproduced, so doing the final screening only in the merge queue will reduce CI runner load. 🤷🏻

Yes, but the problem is you would not see the "red CI" until someone reviews your PR and puts it in the merge queue instead of seeing the "red CI" as soon as you open a PR and the checks run automatically.

…rom TxOut, not CollectorWrites Two related bugs caused from-0 parallel re-exec to diverge from canonical mainnet on early-Frontier blocks where the miner is also a tx sender (e.g. block 200606 and block 218957): 1. finalizeTxSimple's coinbase/burnt override loop scanned result.CollectorWrites when it should have scanned result.TxOut. CollectorWrites is the IBS's "net change" set — for sender == coinbase Frontier self-sends the per-tx net balance change is zero in the IBS journal so the Balance entry is suppressed. But the worker, running with shouldDelayFeeCalc=true, still debited gas-used from the coinbase at execution time and that debit lives in result.TxOut. Scanning CollectorWrites meant the finalize missed the worker's debit and added the fee tip onto the pre-tx versionMap value — over-crediting the coinbase by one tip per such tx. Canonical mainnet block 218957 vs parallel: diverge by exactly 1.05e15 wei = one tip. 2. SetAccountBalanceOrDelete re-emitted the full account (Balance + Nonce + Incarnation + CodeHash) from a pre-block snapshot whenever no BalancePath entry existed in the writeset. If the worker had already written a different field (e.g. Nonce bump on a miner self-send), the trailing pre-block Nonce from the full-account emission would land after the worker's bumped Nonce and win under last-wins downstream merge — the trie would see the pre-bump nonce. Both surface on the same class of tx (sender == coinbase) and together produced the wrong trie root the from-0 parallel matrix has been catching intermittently. Verified locally: re-exec 0..300000 in parallel mode now matches serial against mainnet snapshots-only datadir. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…nder==coinbase/burnt only The previous commit's TxOut scan was too broad: it picked up ANY coinbase BalancePath entry in result.TxOut as the worker's post-execution value. TxOut can contain artifact entries from non-execution paths (e.g. SELFDESTRUCT bookkeeping that touches the zero address when the test coinbase happens to be the zero address), so the broad scan misattributed those artifact values as the worker's coinbase debit and produced wrong finalize results. Surfaced as TestSelfDestructReceive failing under EXEC3_PARALLEL=true. The actual class of write that the previous commit was after — the worker's gas-pre-pay debit when sender == coinbase — only happens when sender == coinbase. Gate the TxOut scan on exactly that condition; fall back to the original CollectorWrites scan for all other cases (where CollectorWrites' net-change view is correct because the worker's coinbase write was an explicit balance transfer, not the suppressed-net-zero miner-self-send shape). Same source-selection rule applied symmetrically to the burnt contract override (use TxOut only when sender == burntAddr). Verified local: - TestSelfDestructReceive (the regression) passes, with and without -race - from-0 parallel re-exec to block 300000 passes (still covers 218957's miner-self-send tx[1] via the sender==coinbase branch) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

claude · 2026-05-25T18:24:33Z

Code review finding 1 of 4 — test-hive-eest.yml line 118

The glamsterdam-devnet shard only has exec_mode: parallel; there is no exec_mode: serial counterpart. The matrix comment says each shard runs twice (once serial, once parallel), but this shard only runs once (parallel only). Serial vs parallel divergence for the BAL EIP path (7708, 7778, 7843, 7928, 7954, 7976, 7981, 8024, 8037) is not caught by the serial leg, defeating the stated goal of this PR for this shard.

A serial entry with the same pool-size, fixtures-tarball, extra-hive-flags, erigon-extra-flags, and max-failures should be added before the existing parallel entry.

claude · 2026-05-25T18:24:44Z

Code review finding 2 of 4 — execution/state/journal.go lines 234-240

The comment in resetObjectChange.dirtied() explicitly documents: 'Known regression at the time of this change: TestSelfDestructReceive fails under EXEC3_PARALLEL=true with a wrong-trie-root'. This PR adds ERIGON_EXEC3_PARALLEL=true CI legs to test-all-erigon.yml, test-all-erigon-race.yml, and others. TestSelfDestructReceive lives in execution/tests/statedb_chain_test.go, has no skip guard, and will run under those legs.

The test will fail because dbg.Exec3Parallel=true (from ERIGON_EXEC3_PARALLEL=true) causes stage_execute.go line 395 to pick the parallel path. The fix in this file correctly addresses the MakeWriteSet drop, but the comment acknowledges a narrower follow-up is needed (tracked in #21217) to handle the empty-touch / CreateAccount case without changing the AddBalance writeset.

Until #21217 is resolved, the parallel legs of test-all-erigon and test-all-erigon-race will fail on TestSelfDestructReceive. Per CLAUDE.md test-skip rules, automated agents cannot add a t.Skip here; the fix must land in this PR or the parallel CI leg must be held until the fix is ready.

claude · 2026-05-25T18:24:51Z

Code review finding 3 of 4 — qa-stage-exec.yml line 146

The upload_test_results.py invocation uses --test_name matrix.test_name which does not include exec_mode. Both serial and parallel matrix entries for the same mode_name (e.g., resume-nonchaintip) call upload_test_results.py with --test_name stage_exec_resume_nonchaintip. The second entry to finish overwrites the first entry's result in the QA system, so one exec-mode outcome is silently lost.

Compare: qa-rpc-integration-tests-latest.yml appends -parallel to the test name for the parallel leg (line 263 of that workflow). qa-stage-exec.yml should do the same: --test_name matrix.test_name appended with something like (matrix.exec_mode == 'parallel' ? '-parallel' : '').

claude · 2026-05-25T18:25:05Z

Code review finding 4 of 4 — execution/engineapi/engine_server.go lines 574-591

When bad=true and cachedErr is empty, the new code logs a debug message and falls through to re-validation, but never removes or updates the badHeaders cache entry. The LRU (96 slots) has no Remove/Delete API — the only way to update an entry is ReportBadHeader which overwrites it.

On the re-validation happy path (block turns out to be VALID), ReportBadHeader is not called, so the stale bad=true/cachedErr='' entry persists indefinitely (until the LRU evicts it at slot 97). Every subsequent newPayload or FCU for that hash enters the same code path and incurs a full-validation round, because IsBadHeader returns bad=true on each call.

The comment says 'Drop the useless entry and fall through to re-validation' but nothing in the code drops it. Either (a) ReportBadHeader/the LRU needs a Remove method so the entry can be evicted here, or (b) if re-validation succeeds (returns ValidStatus), mark the entry as clean. As written the block will be re-validated on every newPayload call for that hash until slot 97 pushes it out.

mh0lt requested review from AskAlexSharov and mriccobene as code owners May 6, 2026 17:39

mh0lt requested a review from yperbasis May 6, 2026 17:41

mh0lt mentioned this pull request May 7, 2026

state, stagedsync: parallel-exec wrong-root fix for SD vs EIP-161 emptyRemoval #21032

Merged

5 tasks

mriccobene reviewed May 7, 2026

View reviewed changes

mh0lt marked this pull request as draft May 7, 2026 11:14

mh0lt marked this pull request as ready for review May 7, 2026 11:16

mh0lt mentioned this pull request May 7, 2026

execution/stagedsync: don't flag maxBlockNum on partial-batch apply-loop exit #21039

Merged

5 tasks

mriccobene approved these changes May 7, 2026

View reviewed changes

mh0lt mentioned this pull request May 7, 2026

stagedsync: fix commitmentCalculator asOfReader.txNum=0 lazy-load on snapshot-loaded chains #21010

Merged

3 tasks

mh0lt mentioned this pull request May 10, 2026

execution/stagedsync, db/state: parallel-commitment correctness for reorg/unwind + SD recreate #21088

Merged

5 tasks

mh0lt force-pushed the mh/ci-exec-mode-matrix branch from a4e5dc5 to 9c125aa Compare May 11, 2026 16:13

This was referenced May 12, 2026

Parallel-exec: residual functional failures skipped to land the exec-mode CI matrix #21136

Open

Parallel-exec: route the per-tx writeset through one faithful path; remove normalizeWriteSet/calcState; carry serial-finalize signals on ExecutionResult #21138

Open

taratorio and others added 5 commits May 23, 2026 00:04

Merge branch 'main' of github.com:erigontech/erigon into mh/ci-exec-m…

f8c9b15

…ode-matrix

execution/stagedsync: revert warmup-cache parallel gate from this branch

f02c5dd

Reverts b9f6208. This branch is the CI-matrix config and should carry only .github/ changes; the parallel-exec warmup-cache fix moves to a dedicated Go-fix PR.

Revert "execution/stagedsync: revert warmup-cache parallel gate from …

f4071e7

…this branch" This reverts commit f02c5dd.

mh0lt mentioned this pull request May 22, 2026

execution/stagedsync: tests for finalize coinbase/burnt fee write invalidation #21373

Merged

3 tasks

mh0lt and others added 5 commits May 23, 2026 06:20

Merge branch 'main' into mh/ci-exec-mode-matrix

c26cd6e

This was referenced May 23, 2026

hive eest: flaky EIP-7928 test_bal_invalid_extraneous_entries on glamsterdam-devnet shard #21364

Open

hive eest: flaky EIP-7954 test_max_initcode_size on glamsterdam-devnet shard #21363

Closed

This was referenced May 23, 2026

execution/state: mirror createObjectChange dirty-tracking on resetObjectChange (#21138) #21220

Closed

State Cache Consolidation (PR #1 of the perf stack) #21380

Open

mh0lt and others added 2 commits May 25, 2026 16:29

yperbasis mentioned this pull request May 29, 2026

ci: restore discv5 to devp2p Hive shard #21512

Merged

Conversation

mh0lt commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Affected workflows

Plumbing details

Review follow-ups (2026-05-15 rebuild)

Test plan

Uh oh!

mriccobene May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mh0lt May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mriccobene May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mh0lt May 7, 2026

Choose a reason for hiding this comment

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 7, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 8, 2026

Uh oh!

mh0lt commented May 10, 2026

Uh oh!

mh0lt commented May 11, 2026

Uh oh!

mh0lt commented May 11, 2026

Uh oh!

anacrolix commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anacrolix commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 25, 2026

Uh oh!

claude Bot commented May 25, 2026

Uh oh!

claude Bot commented May 25, 2026

Uh oh!

claude Bot commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mh0lt commented May 6, 2026 •

edited

Loading

taratorio commented May 25, 2026 •

edited

Loading

taratorio commented May 25, 2026 •

edited

Loading