execution: fix sporadic bal mismatches due to phantom accesses in system txns#21654
Merged
Conversation
…licit-sequential-eest-test-shards
…licit-sequential-eest-test-shards
…b.com:erigontech/erigon into bal-devnet-7
…tetest-speed-regression
…tetest-speed-regression
…/erigon into worktree-eest-spec-v7.2.0
…ontech/erigon into worktree-eest-spec-v7.2.0
…gon into bal-devnet-7
…-bal-devnet-7-bals-sstore-noops
…com:erigontech/erigon into glamsterdam-devnet-4
…-mismatches-sporadically
mh0lt
approved these changes
Jun 6, 2026
mh0lt
added a commit
that referenced
this pull request
Jun 12, 2026
Brings origin/main (through the latest tip) into the typed-vio branch (#21536). Resolution: - committer.go: dropped the BAL-ahead-fold (foldedAhead/maybeFoldAhead/ foldBlockFromBAL/shadowCrossCheck) — it was introduced inside the typed-vio commit and is not on main. Took main's committer.go (the #21659 changeset-window: perBlockFrom/computeTransition) and removed the matching wiring from exec3.go / exec3_parallel.go (blockRequest channel + type, the unused calcMode enum). - execution/state: kept the typed per-path versioned reads/writes; applied main's #21667 (accept Dependency/Estimate cells), #21590 (SD-revival), #21659 (per-batch changesets), #21654 (phantom accesses). The storage net-zero read-value filter is retained in updateWrite, with new guard tests. Verified: build, make lint, execution/{state,exec,stagedsync,commitment} unit tests, and eest-spec parallel blocktests (devnet 82941/0, stable 69256/0).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #21651 — non-deterministic parallel-exec BAL computation rejecting canonical blocks as invalid and wedging the CL.
Root cause
BAL address-access recording is started by
ibs.Prepare(allocates a freshaddressAccessmap, setsrecordAccess = true) and ended by theibs.AccessedAddresses()harvest at the end ofTxTask.Execute. The bug is that only regular user transactions callPrepare(reached viaApplyMessage→execution/protocol/txn_executor.go:423/590), while system tasks don't — yet system tasks still perform the harvest:TxIndex == -1) runsengine.Initialize+ syscalls viaSysCallContract— noApplyMessage, noPrepare;IsBlockEnd()) executes no EVM code at all.Both end in
TxTask.Execute's success-path harvest (result.AccessedAddresses = ibs.AccessedAddresses(),execution/exec/txtask.go:624). On a clean worker IBS that harvest returns nil —recordAccessis false, so even the Initialize syscalls record nothing, which is the spec-correct behavior (system-call touches don't belong in the BAL). The system tasks therefore implicitly relied on the worker IBS being pristine.That assumption broke in combination with two other facts:
TxTask.Executeskips it whenresult.Err != nil(execution/exec/txtask.go:618) — so the aborted tx's touched-address set andrecordAccess = truesurvive on the worker's long-lived IBS.IntraBlockState.Reset(), called before every task, cleared all other per-task state (versioned reads/writes, journal, EIP-2929 access list, transient storage) but notaddressAccess/recordAccess. Regular txs were still immune because theirPrepareoverwrites the residue; system tasks were not.pe.inqueue), so the system task that lands on the residue-laden worker usually belongs to a different block than the aborted tx. It scoops the residue as its own accesses, andnextResultrecords them into its block'sblockIO(execution/stagedsync/exec3_parallel.go:2301).AsBlockAccessListthen creates an entry for every accessed address (execution/state/versionedio.go:1417) → phantom empty, address-only entries in the computed BAL. Which worker holds residue and which block's system task lands on it is schedule-dependent → the computed BAL hash differs across re-execution attempts of the same block. This explains all field symptoms: extra empty accounts, precompile-range extras, extras belonging to neighboring blocks' canonical BALs, and computed BALs with more accounts than the stored sidecar.Reproduction (before the fix)
Local glamsterdam-devnet-5, erigon @
1ca634d4(ERIGON_EXEC3_PARALLEL=true) + prysmglamsterdam-devnet-5, full resync from genesis. During catch-up batch execution the node rejected canonical block 6638:0xd4093C4A57D6F952849D6bf47e9a1F6CDCa79b7ais address-only (no reads, no changes), absent from 6638's canonical BAL, and legitimately active in a dozen neighboring blocks of the same batch (6600–6619)Same signature in 9 earlier mismatch dumps (blocks 1405–1430): every extra address was an empty entry appearing canonically in other blocks of the same batch.
Fix
Clear
recordAccess/addressAccessinReset()— i.e. makeReset()guarantee for every task whatPrepareonly guaranteed for user transactions, so a task'sAccessedAddressescan only contain addresses touched during that task's own execution. This also covers any future Prepare-less task type, not just block-init/block-end. Audited the fullIntraBlockStatestruct: this was the only accumulating fieldReset()missed; no caller relies on recording state survivingReset().Testing
TestAddressAccessResetInIBSResetwritten first — red on old code, green with the fix../execution/state/...(incl.-race) and./execution/exec/...pass.TestEngineApiBALParallelConsistencyStress×50 with-race(the canary for this bug class).make lintclean (×2).