execution/stagedsync: fix parallel-exec BAL race from duplicate coinbase BalancePath write in finalize#21177
Merged
Merged
Conversation
…fee coinbase BalancePath from finalize allWrites finalizeTxSimple builds allWrites = copy(result.TxOut) + append(post-fee coinbase BalancePath). When sender == coinbase the worker's TxOut already contains a (coinbase, BalancePath, nil-key) entry holding the pre-fee balance, and the append adds a second entry at the same key holding the post-fee balance. ibs.ApplyVersionedWrites (called for the EIP-7708 post-apply IBS reconstruction) runs sort.Slice — which is not stable — so the two equal-key entries end up in random order; MergeVersionedWrites in nextResult then iterates the slice and the LAST-visited entry overwrites the earlier one in the merged WriteSet. RecordWrites stores that merged set into be.blockIO and the validator BAL accumulator reads it back at accessIndex = txIndex+1. Whichever of the two entries the unstable sort placed second wins, so the computed BAL coinbase BalancePath flips between pre-fee and post-fee per run. This race surfaced as flaky failures in TestEngineApiBALMultiTxBlock / TestEngineApiBALParallelConsistencyStress / TestEngineApiBALSelfDestruct on the bal-devnet-2 deploy block (sender == genesis coinbase), where the computed BAL coinbase value landed exactly FeeTipped short of the proposer-stored value. Adding a fmt.Printf anywhere in the merge path masked the race by re-ordering the unstable-sort output. Skip the (sender, BalancePath) entry copy from TxOut when sender matches coinbase or burntAddr — the finalize append below is the authoritative post-fee value, and the post-apply IBS reconstruction applies BalancePath via SetBalance which is idempotent w.r.t. the worker's pre-fee write either way. After the fix the failing test passes 10/10 (was 4/10) and the engineapi BAL suite passes 8/8.
…edRead versionedStateReader.ReadAccountData already handles "self-destructed then revived" semantics (versionedio.go:273-293): when versionMap shows SelfDestructPath=true at depIdx, but BalancePath/NoncePath/CodeHashPath has a write at TxIndex > depIdx, the account was re-created and the SD must not short-circuit the read. versionedRead — the generic worker-side read path used by ibs.GetBalance, GetNonce, GetCodeHash etc. — was missing the same check. Workers got the post-revival reads wrong, returning zero for any field on an SD-flagged address regardless of a later revival write. This surfaces under EIP-161 + delayed-fee parallel exec: finalizeTxSimple emits SelfDestructPath=true for the coinbase when a tx with FeeTipped=0 touches it (empty-removal); a later tx whose FeeTipped > 0 writes BalancePath at the higher TxIndex. The serial equivalent is AddBalance(amount>0) on a previously-deleted stateObject, where GetOrNewStateObject implicitly recreates a fresh object — the next tx that does BALANCE(coinbase) sees the post-revival balance. Under parallel exec, the worker's versionedRead short-circuited on the SD entry without consulting the revival writes; the BALANCE returned zero, the contract SSTORE'd zero into a previously-non-zero slot, and EIP-2200's SSTORE_CLEAR_REFUND (4800 gas) fired spuriously — TestLegacyBlockchain/ValidBlocks/bcEIP3675/tipInsideBlock failing with gas exactly 4800 short under ERIGON_EXEC3_PARALLEL=true. The fix mirrors ReadAccountData's revival check (LatestTxIndex on BalancePath, NoncePath, CodeHashPath > destructTxIndex). When any of those have a later write, fall through to the normal versionMap.Read of the requested path — making the worker observe the revived state exactly as serial does. Verified: - bcEIP3675/tipInsideBlock 5/5 parallel (was 0/5) - TestEngineApiBAL* 8/8 parallel (no regression) - SD-family: TestDeleteRecreate*, TestSelfDestruct*, TestEIP161*, TestCVE2020_26265, TestEIP7708 all pass parallel - Full TestLegacyBlockchain pass parallel - make lint clean
taratorio
approved these changes
May 14, 2026
mh0lt
added a commit
that referenced
this pull request
May 14, 2026
Addresses taratorio's review comments on PR #21153 — "these tests should not be skipped" / "this one seems to be the last Skip pending removal". (1) execution/tests/block_test.go: drop the SkipLoad of ValidBlocks/bcEIP3675/tipInsideBlock.json under dbg.Exec3Parallel. The 4800-gas mismatch was the EIP-161 coinbase empty-removal + delayed-fee race: tx 0's finalize emits SelfDestructPath=true for the coinbase (empty-removal: FeeTipped=0 + coinbaseEmptyPre=true); tx 1's finalize then credits FeeTipped > 0, reviving the account; tx 2's worker reads coinbase BALANCE and versionedRead's SD-check branch short-circuits to zero without consulting the later revival write — so the contract SSTOREs 0 over a previously-non-zero slot and EIP-2200's SSTORE_CLEAR_REFUND (4800) fires spuriously. Resolved by the prior commit (cherry-picked from #21177) which mirrors ReadAccountData's existing revival check into versionedRead — the fixture now passes 5/5 under ERIGON_EXEC3_PARALLEL=true. The bug described in #21136 (tipInsideBlock parallel-exec gas mismatch) is fixed; the issue can be closed when this lands. Also drops the unused dbg import. (2) rpc/jsonrpc/gen_traces_test.go: drop the dbg.Exec3Parallel-gated t.Skip in TestGeneratedTraceApiCollision. The skip referenced "intra-block SELFDESTRUCT + CREATE2 reincarnation not yet supported by parallel executor — fixed on exec3/remove-rwtx- threading branch". Verified the test passes 8/8 under ERIGON_EXEC3_PARALLEL=true and 3/3 serial on the current branch — the capability has landed somewhere upstream of this PR's base. The trace output for the collision case matches the expected golden under both execution modes. Also drops the unused dbg import. The remaining unconditional t.Skip in execution/commitment/hex_patricia_hashed_test.go:3366 (Test_ModeUpdate_SiblingConsistency, #20961) is intentionally left as a regression marker — confirmed the underlying ModeUpdate vs ModeDirect sibling-encoding divergence still reproduces; the fix is in commitment cell-cache invalidation, out of scope for the parallel-exec PR stack.
6 tasks
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 21, 2026
…rgs) (erigontech#21211) ## Summary First incremental cut toward [erigontech#21138](erigontech#21138 structural goal: **one finalize function per parallel-exec result, with `IntraBlockState` used nowhere outside workers**. This PR removes two finalize variants that are already unreachable from production: | Function | LOC | Production callers on main | |---|---|---| | `finalizeWithIBS` (full IBS reconstruction, BAL-compat path) | ~120 | 0 | | `finalizeTx` (delta-args variant, direct fee-balance path) | ~250 | 0 (only `TestFinalizeTx_AllScenarios`) | Plus the test suite that exclusively exercised the delta-args path (`TestFinalizeTx_*`, fixture builders `coinbaseIsRecipientScenario` / `selfTransferScenario`, helpers `hasCoinbaseDelta` / `adjustForTransferDelta` / `buildWriteMap` / `fmtWriteVal` / `extractBalanceReads`) and one stale comment in `engine_api_bal_test.go`. Net: **-690 lines**, **+1 line**, no semantic change. ## Why now The parallel-exec correctness stack landed in erigontech#21153 (merged 2026-05-15). The combined effect of that PR plus erigontech#21177 routed all production finalize flows through `finalizeTxSimple` — these two functions became unreachable. Removing them shrinks `exec3_parallel.go` from 3640 → 3268 lines, making subsequent IBS-dependency drains easier to review. The next steps in the erigontech#21138 sequence: - **PR 2** — drain IBS dependency #1 (SD address lookup): `LogSelfDestructedAccounts` consumes `result.SelfDestructedWithBalance` only, no `ibs.GetRemovedAccountsWithBalance()` call. - **PR 3** — drain IBS deps #2 (`AddLog` → return logs) and #3 (`AddBalance` bookkeeping → already on `CollectorWrites`); `finalizeTxSimple` becomes IBS-free. - Later — `normalizeWriteSet` → `filterWritesByVersionMap`; `calcState.ApplyWrites` → `VersionedWrites.TouchUpdates`; move EIP-7002/7251 syscall execution into the worker pool. End state: one `finalizeTx`, no IBS outside workers. ## Test plan - [x] `make lint` clean - [x] `make test-short` (full `execution/stagedsync`, `execution/state`, `execution/tests`, `rpc/jsonrpc` packages) green under `EXEC3_PARALLEL=true` - [x] BAL family (`TestEngineApiBAL*`) 8/8 parallel - [x] `TestEIP7708BurnLogWhenCoinbaseSelfDestructs` green - [x] Surviving `TestFinalizeTxSimple_*` family green - [ ] CI: race-tests, kurtosis, hive matrix legs green on both serial and parallel ## Related - erigontech#21017 — serial/parallel CI matrix that surfaces parallel-leg failures (now rebuilt on post-erigontech#21153 main; CI fresh-running) - erigontech#21153 — parallel-exec correctness stack (merged) - erigontech#21138 — heuristic-removal / IBS-dependency-removal tracker (the parent)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a flaky parallel-exec BAL hash mismatch when sender == coinbase (the
genesis-coinbase deploy block case, surfaced by
TestEngineApiBALMultiTxBlock,TestEngineApiBALParallelConsistencyStress,TestEngineApiBALSelfDestruct).finalizeTxSimplewas buildingallWrites = copy(result.TxOut) + append(post-fee coinbase BalancePath). When the sender of a tx happens to be the block'scoinbase,
result.TxOutalready contains a(coinbase, BalancePath, key=nil)entry holding the pre-fee sender balance (from the worker, which runs with
shouldDelayFeeCalc=true), and the append adds a second entry at the same(Address, Path, Key)holding the post-fee value.The post-apply IBS reconstruction calls
ibs.ApplyVersionedWrites(allWrites)which runs
sort.Slice— and that sort is not stable. With two equal-keyentries the sort orders them arbitrarily, so by the time the caller in
nextResultrunsmerged := MergeVersionedWrites(existingWrites, addWrites),the iteration order through
addWritesflips between picking pre-fee andpost-fee as the last-visited entry for the (coinbase, BalancePath) key in the
merged
WriteSet.RecordWritesstores that intobe.blockIO, the validatorBAL accumulator reads it back at
accessIndex = txIndex+1, and the computedBAL coinbase value lands
FeeTippedshort of the proposer-stored value onsome fraction of runs.
This was a real Heisenbug — adding any
fmt.Printfnear the merge pathre-ordered the sort enough to mask the race.
Fix
Skip the worker's
(sender, BalancePath)entries fromresult.TxOutwhensender matches
result.CoinbaseorburntAddrwhile copying intoallWrites.The finalize append below is the authoritative post-fee value; the
post-apply IBS reconstruction applies BalancePath via
SetBalancewhich isidempotent w.r.t. the worker's pre-fee write.
Test plan
TestEngineApiBALMultiTxBlock— 10/10 underERIGON_EXEC3_PARALLEL=true -tags=experimentalBAL(was 4/8 without fix on the same machine)TestEngineApiBAL*full suite — 8/8 in parallel modemake lintclean