Skip to content

execution: fix sporadic bal mismatches due to phantom accesses in system txns#21654

Merged
mh0lt merged 65 commits into
mainfrom
worktree-bal-mismatches-sporadically
Jun 6, 2026
Merged

execution: fix sporadic bal mismatches due to phantom accesses in system txns#21654
mh0lt merged 65 commits into
mainfrom
worktree-bal-mismatches-sporadically

Conversation

@taratorio

Copy link
Copy Markdown
Member

Fixes #21651 — non-deterministic parallel-exec BAL computation rejecting canonical blocks as invalid and wedging the CL.

Root cause

BAL address-access recording is started by ibs.Prepare (allocates a fresh addressAccess map, sets recordAccess = true) and ended by the ibs.AccessedAddresses() harvest at the end of TxTask.Execute. The bug is that only regular user transactions call Prepare (reached via ApplyMessageexecution/protocol/txn_executor.go:423/590), while system tasks don't — yet system tasks still perform the harvest:

  • block-init (TxIndex == -1) runs engine.Initialize + syscalls via SysCallContract — no ApplyMessage, no Prepare;
  • block-end (IsBlockEnd()) executes no EVM code at all.

Both end in TxTask.Execute's success-path harvest (result.AccessedAddresses = ibs.AccessedAddresses(), execution/exec/txtask.go:624). On a clean worker IBS that harvest returns nil — recordAccess is false, so even the Initialize syscalls record nothing, which is the spec-correct behavior (system-call touches don't belong in the BAL). The system tasks therefore implicitly relied on the worker IBS being pristine.

That assumption broke in combination with two other facts:

  1. A conflict-aborted incarnation never reaches the harvest — TxTask.Execute skips it when result.Err != nil (execution/exec/txtask.go:618) — so the aborted tx's touched-address set and recordAccess = true survive on the worker's long-lived IBS. IntraBlockState.Reset(), called before every task, cleared all other per-task state (versioned reads/writes, journal, EIP-2929 access list, transient storage) but not addressAccess/recordAccess. Regular txs were still immune because their Prepare overwrites the residue; system tasks were not.
  2. Workers are shared by all in-flight block executors of a batch (single pe.in queue), so the system task that lands on the residue-laden worker usually belongs to a different block than the aborted tx. It scoops the residue as its own accesses, and nextResult records them into its block's blockIO (execution/stagedsync/exec3_parallel.go:2301).

AsBlockAccessList then creates an entry for every accessed address (execution/state/versionedio.go:1417) → phantom empty, address-only entries in the computed BAL. Which worker holds residue and which block's system task lands on it is schedule-dependent → the computed BAL hash differs across re-execution attempts of the same block. This explains all field symptoms: extra empty accounts, precompile-range extras, extras belonging to neighboring blocks' canonical BALs, and computed BALs with more accounts than the stored sidecar.

Reproduction (before the fix)

Local glamsterdam-devnet-5, erigon @ 1ca634d4 (ERIGON_EXEC3_PARALLEL=true) + prysm glamsterdam-devnet-5, full resync from genesis. During catch-up batch execution the node rejected canonical block 6638:

  • computed BAL: 340 accounts; canonical: 339
  • the extra entry 0xd4093C4A57D6F952849D6bf47e9a1F6CDCa79b7a is address-only (no reads, no changes), absent from 6638's canonical BAL, and legitimately active in a dozen neighboring blocks of the same batch (6600–6619)
  • the post-unwind re-execution of 6638 computed 339 accounts and passed — non-determinism confirmed on the same block within one run

Same signature in 9 earlier mismatch dumps (blocks 1405–1430): every extra address was an empty entry appearing canonically in other blocks of the same batch.

Fix

Clear recordAccess/addressAccess in Reset() — i.e. make Reset() guarantee for every task what Prepare only guaranteed for user transactions, so a task's AccessedAddresses can only contain addresses touched during that task's own execution. This also covers any future Prepare-less task type, not just block-init/block-end. Audited the full IntraBlockState struct: this was the only accumulating field Reset() missed; no caller relies on recording state surviving Reset().

Testing

  • TDD: TestAddressAccessResetInIBSReset written first — red on old code, green with the fix.
  • ./execution/state/... (incl. -race) and ./execution/exec/... pass.
  • All engine-api BAL tests pass, incl. TestEngineApiBALParallelConsistencyStress ×50 with -race (the canary for this bug class).
  • Devnet validation: full glamsterdam-devnet-5 resync from genesis with the fixed binary — genesis → tip 11431 (delta 0 vs network), 0 BAL mismatches, 0 invalid blocks, through every previously-failing range (960, 1405–1430, 5545–6320, 6638); at tip NewPayload/FCU return VALID and prysm imports live blocks normally.
  • make lint clean (×2).

taratorio added 30 commits May 16, 2026 09:24
…ontech/erigon into worktree-eest-spec-v7.2.0
taratorio added 18 commits May 25, 2026 13:03
…com:erigontech/erigon into glamsterdam-devnet-4
@taratorio taratorio requested review from mh0lt and yperbasis as code owners June 6, 2026 06:18
@taratorio taratorio changed the title execution: fix sporadic bal mismatches due to phantom accesses in sys execution: fix sporadic bal mismatches due to phantom accesses in system txns Jun 6, 2026
@taratorio taratorio requested a review from AskAlexSharov June 6, 2026 06:18
@taratorio taratorio added the Glamsterdam https://eips.ethereum.org/EIPS/eip-7773 label Jun 6, 2026
@mh0lt mh0lt added this pull request to the merge queue Jun 6, 2026
Merged via the queue into main with commit 348041b Jun 6, 2026
90 checks passed
@mh0lt mh0lt deleted the worktree-bal-mismatches-sporadically branch June 6, 2026 10:36
mh0lt added a commit that referenced this pull request Jun 12, 2026
Brings origin/main (through the latest tip) into the typed-vio branch (#21536).

Resolution:
- committer.go: dropped the BAL-ahead-fold (foldedAhead/maybeFoldAhead/
  foldBlockFromBAL/shadowCrossCheck) — it was introduced inside the typed-vio
  commit and is not on main. Took main's committer.go (the #21659
  changeset-window: perBlockFrom/computeTransition) and removed the matching
  wiring from exec3.go / exec3_parallel.go (blockRequest channel + type, the
  unused calcMode enum).
- execution/state: kept the typed per-path versioned reads/writes; applied
  main's #21667 (accept Dependency/Estimate cells), #21590 (SD-revival),
  #21659 (per-batch changesets), #21654 (phantom accesses). The storage
  net-zero read-value filter is retained in updateWrite, with new guard tests.

Verified: build, make lint, execution/{state,exec,stagedsync,commitment} unit
tests, and eest-spec parallel blocktests (devnet 82941/0, stable 69256/0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Glamsterdam https://eips.ethereum.org/EIPS/eip-7773

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[glamsterdam-devnet-5] non-deterministic parallel-exec BAL computation rejects canonical blocks as invalid, wedging the CL

3 participants