Skip to content

execution/stagedsync: fix parallel-exec BAL race from duplicate coinbase BalancePath write in finalize#21177

Merged
taratorio merged 4 commits into
mainfrom
mh/bal-finalize-duplicate-write
May 14, 2026
Merged

execution/stagedsync: fix parallel-exec BAL race from duplicate coinbase BalancePath write in finalize#21177
taratorio merged 4 commits into
mainfrom
mh/bal-finalize-duplicate-write

Conversation

@mh0lt

@mh0lt mh0lt commented May 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a flaky parallel-exec BAL hash mismatch when sender == coinbase (the
genesis-coinbase deploy block case, surfaced by TestEngineApiBALMultiTxBlock,
TestEngineApiBALParallelConsistencyStress, TestEngineApiBALSelfDestruct).

finalizeTxSimple was building allWrites = copy(result.TxOut) + append(post-fee coinbase BalancePath). When the sender of a tx happens to be the block's
coinbase, result.TxOut already contains a (coinbase, BalancePath, key=nil)
entry holding the pre-fee sender balance (from the worker, which runs with
shouldDelayFeeCalc=true), and the append adds a second entry at the same
(Address, Path, Key) holding the post-fee value.

The post-apply IBS reconstruction calls ibs.ApplyVersionedWrites(allWrites)
which runs sort.Slice — and that sort is not stable. With two equal-key
entries the sort orders them arbitrarily, so by the time the caller in
nextResult runs merged := MergeVersionedWrites(existingWrites, addWrites),
the iteration order through addWrites flips between picking pre-fee and
post-fee as the last-visited entry for the (coinbase, BalancePath) key in the
merged WriteSet. RecordWrites stores that into be.blockIO, the validator
BAL accumulator reads it back at accessIndex = txIndex+1, and the computed
BAL coinbase value lands FeeTipped short of the proposer-stored value on
some fraction of runs.

This was a real Heisenbug — adding any fmt.Printf near the merge path
re-ordered the sort enough to mask the race.

Fix

Skip the worker's (sender, BalancePath) entries from result.TxOut when
sender matches result.Coinbase or burntAddr while copying into allWrites.
The finalize append below is the authoritative post-fee value; the
post-apply IBS reconstruction applies BalancePath via SetBalance which is
idempotent w.r.t. the worker's pre-fee write.

Test plan

  • TestEngineApiBALMultiTxBlock — 10/10 under ERIGON_EXEC3_PARALLEL=true -tags=experimentalBAL (was 4/8 without fix on the same machine)
  • TestEngineApiBAL* full suite — 8/8 in parallel mode
  • SD / EIP-7708 / state regression tests pass in parallel mode
  • EngineApi suite passes in serial mode (no regression)
  • make lint clean

…fee coinbase BalancePath from finalize allWrites

finalizeTxSimple builds allWrites = copy(result.TxOut) + append(post-fee
coinbase BalancePath). When sender == coinbase the worker's TxOut
already contains a (coinbase, BalancePath, nil-key) entry holding the
pre-fee balance, and the append adds a second entry at the same key
holding the post-fee balance. ibs.ApplyVersionedWrites (called for the
EIP-7708 post-apply IBS reconstruction) runs sort.Slice — which is not
stable — so the two equal-key entries end up in random order;
MergeVersionedWrites in nextResult then iterates the slice and the
LAST-visited entry overwrites the earlier one in the merged WriteSet.
RecordWrites stores that merged set into be.blockIO and the validator
BAL accumulator reads it back at accessIndex = txIndex+1. Whichever of
the two entries the unstable sort placed second wins, so the computed
BAL coinbase BalancePath flips between pre-fee and post-fee per run.

This race surfaced as flaky failures in TestEngineApiBALMultiTxBlock /
TestEngineApiBALParallelConsistencyStress / TestEngineApiBALSelfDestruct
on the bal-devnet-2 deploy block (sender == genesis coinbase), where
the computed BAL coinbase value landed exactly FeeTipped short of the
proposer-stored value. Adding a fmt.Printf anywhere in the merge path
masked the race by re-ordering the unstable-sort output.

Skip the (sender, BalancePath) entry copy from TxOut when sender
matches coinbase or burntAddr — the finalize append below is the
authoritative post-fee value, and the post-apply IBS reconstruction
applies BalancePath via SetBalance which is idempotent w.r.t. the
worker's pre-fee write either way. After the fix the failing test
passes 10/10 (was 4/10) and the engineapi BAL suite passes 8/8.
@mh0lt mh0lt requested a review from yperbasis as a code owner May 13, 2026 20:27
@mh0lt mh0lt requested a review from taratorio May 13, 2026 20:29
mh0lt and others added 3 commits May 13, 2026 22:27
…edRead

versionedStateReader.ReadAccountData already handles "self-destructed
then revived" semantics (versionedio.go:273-293): when versionMap shows
SelfDestructPath=true at depIdx, but BalancePath/NoncePath/CodeHashPath
has a write at TxIndex > depIdx, the account was re-created and the SD
must not short-circuit the read.

versionedRead — the generic worker-side read path used by ibs.GetBalance,
GetNonce, GetCodeHash etc. — was missing the same check. Workers got the
post-revival reads wrong, returning zero for any field on an SD-flagged
address regardless of a later revival write.

This surfaces under EIP-161 + delayed-fee parallel exec:
finalizeTxSimple emits SelfDestructPath=true for the coinbase when a tx
with FeeTipped=0 touches it (empty-removal); a later tx whose FeeTipped
> 0 writes BalancePath at the higher TxIndex. The serial equivalent is
AddBalance(amount>0) on a previously-deleted stateObject, where
GetOrNewStateObject implicitly recreates a fresh object — the next tx
that does BALANCE(coinbase) sees the post-revival balance. Under
parallel exec, the worker's versionedRead short-circuited on the SD
entry without consulting the revival writes; the BALANCE returned
zero, the contract SSTORE'd zero into a previously-non-zero slot, and
EIP-2200's SSTORE_CLEAR_REFUND (4800 gas) fired spuriously —
TestLegacyBlockchain/ValidBlocks/bcEIP3675/tipInsideBlock failing with
gas exactly 4800 short under ERIGON_EXEC3_PARALLEL=true.

The fix mirrors ReadAccountData's revival check (LatestTxIndex on
BalancePath, NoncePath, CodeHashPath > destructTxIndex). When any of
those have a later write, fall through to the normal versionMap.Read
of the requested path — making the worker observe the revived state
exactly as serial does.

Verified:
- bcEIP3675/tipInsideBlock 5/5 parallel (was 0/5)
- TestEngineApiBAL* 8/8 parallel (no regression)
- SD-family: TestDeleteRecreate*, TestSelfDestruct*, TestEIP161*,
  TestCVE2020_26265, TestEIP7708 all pass parallel
- Full TestLegacyBlockchain pass parallel
- make lint clean
@taratorio taratorio enabled auto-merge May 14, 2026 09:32
@taratorio taratorio added this pull request to the merge queue May 14, 2026
mh0lt added a commit that referenced this pull request May 14, 2026
Addresses taratorio's review comments on PR #21153 — "these tests should
not be skipped" / "this one seems to be the last Skip pending removal".

(1) execution/tests/block_test.go: drop the SkipLoad of
ValidBlocks/bcEIP3675/tipInsideBlock.json under dbg.Exec3Parallel.

The 4800-gas mismatch was the EIP-161 coinbase empty-removal +
delayed-fee race: tx 0's finalize emits SelfDestructPath=true for the
coinbase (empty-removal: FeeTipped=0 + coinbaseEmptyPre=true); tx 1's
finalize then credits FeeTipped > 0, reviving the account; tx 2's
worker reads coinbase BALANCE and versionedRead's SD-check branch
short-circuits to zero without consulting the later revival write —
so the contract SSTOREs 0 over a previously-non-zero slot and
EIP-2200's SSTORE_CLEAR_REFUND (4800) fires spuriously.

Resolved by the prior commit (cherry-picked from #21177) which mirrors
ReadAccountData's existing revival check into versionedRead — the
fixture now passes 5/5 under ERIGON_EXEC3_PARALLEL=true. The bug
described in #21136 (tipInsideBlock parallel-exec gas mismatch) is
fixed; the issue can be closed when this lands.

Also drops the unused dbg import.

(2) rpc/jsonrpc/gen_traces_test.go: drop the dbg.Exec3Parallel-gated
t.Skip in TestGeneratedTraceApiCollision.

The skip referenced "intra-block SELFDESTRUCT + CREATE2 reincarnation
not yet supported by parallel executor — fixed on exec3/remove-rwtx-
threading branch". Verified the test passes 8/8 under
ERIGON_EXEC3_PARALLEL=true and 3/3 serial on the current branch —
the capability has landed somewhere upstream of this PR's base. The
trace output for the collision case matches the expected golden under
both execution modes.

Also drops the unused dbg import.

The remaining unconditional t.Skip in
execution/commitment/hex_patricia_hashed_test.go:3366
(Test_ModeUpdate_SiblingConsistency, #20961) is intentionally left as
a regression marker — confirmed the underlying ModeUpdate vs ModeDirect
sibling-encoding divergence still reproduces; the fix is in commitment
cell-cache invalidation, out of scope for the parallel-exec PR stack.
Merged via the queue into main with commit 9912fc5 May 14, 2026
68 checks passed
@taratorio taratorio deleted the mh/bal-finalize-duplicate-write branch May 14, 2026 10:49
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 21, 2026
…rgs) (erigontech#21211)

## Summary

First incremental cut toward
[erigontech#21138](erigontech#21138 structural
goal: **one finalize function per parallel-exec result, with
`IntraBlockState` used nowhere outside workers**.

This PR removes two finalize variants that are already unreachable from
production:

| Function | LOC | Production callers on main |
|---|---|---|
| `finalizeWithIBS` (full IBS reconstruction, BAL-compat path) | ~120 |
0 |
| `finalizeTx` (delta-args variant, direct fee-balance path) | ~250 | 0
(only `TestFinalizeTx_AllScenarios`) |

Plus the test suite that exclusively exercised the delta-args path
(`TestFinalizeTx_*`, fixture builders `coinbaseIsRecipientScenario` /
`selfTransferScenario`, helpers `hasCoinbaseDelta` /
`adjustForTransferDelta` / `buildWriteMap` / `fmtWriteVal` /
`extractBalanceReads`) and one stale comment in
`engine_api_bal_test.go`.

Net: **-690 lines**, **+1 line**, no semantic change.

## Why now

The parallel-exec correctness stack landed in erigontech#21153 (merged
2026-05-15). The combined effect of that PR plus erigontech#21177 routed all
production finalize flows through `finalizeTxSimple` — these two
functions became unreachable. Removing them shrinks `exec3_parallel.go`
from 3640 → 3268 lines, making subsequent IBS-dependency drains easier
to review.

The next steps in the erigontech#21138 sequence:

- **PR 2** — drain IBS dependency #1 (SD address lookup):
`LogSelfDestructedAccounts` consumes `result.SelfDestructedWithBalance`
only, no `ibs.GetRemovedAccountsWithBalance()` call.
- **PR 3** — drain IBS deps #2 (`AddLog` → return logs) and #3
(`AddBalance` bookkeeping → already on `CollectorWrites`);
`finalizeTxSimple` becomes IBS-free.
- Later — `normalizeWriteSet` → `filterWritesByVersionMap`;
`calcState.ApplyWrites` → `VersionedWrites.TouchUpdates`; move
EIP-7002/7251 syscall execution into the worker pool.

End state: one `finalizeTx`, no IBS outside workers.

## Test plan

- [x] `make lint` clean
- [x] `make test-short` (full `execution/stagedsync`, `execution/state`,
`execution/tests`, `rpc/jsonrpc` packages) green under
`EXEC3_PARALLEL=true`
- [x] BAL family (`TestEngineApiBAL*`) 8/8 parallel
- [x] `TestEIP7708BurnLogWhenCoinbaseSelfDestructs` green
- [x] Surviving `TestFinalizeTxSimple_*` family green
- [ ] CI: race-tests, kurtosis, hive matrix legs green on both serial
and parallel

## Related

- erigontech#21017 — serial/parallel CI matrix that surfaces parallel-leg failures
(now rebuilt on post-erigontech#21153 main; CI fresh-running)
- erigontech#21153 — parallel-exec correctness stack (merged)
- erigontech#21138 — heuristic-removal / IBS-dependency-removal tracker (the
parent)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants