Skip to content

execution/state: typed-vio refactor (L2 — typed AddressEntry + WriteCell[T] + sync.Pool)#21536

Open
mh0lt wants to merge 24 commits into
mainfrom
vio-on-main
Open

execution/state: typed-vio refactor (L2 — typed AddressEntry + WriteCell[T] + sync.Pool)#21536
mh0lt wants to merge 24 commits into
mainfrom
vio-on-main

Conversation

@mh0lt

@mh0lt mh0lt commented May 31, 2026

Copy link
Copy Markdown
Contributor

What this is

Lifts the typed-vio refactor out of mh/perf-all-followups (which has it bundled with 31k other commits) and onto current main, on top of #21516's profiling instrumentation.

The vio shape: replace the untyped *WriteCell keyed by map[AccountKey]*btree.Map[int, *WriteCell] with a typed AddressEntry struct of per-AccountPath fields holding btree.Map[int, *WriteCell[T]] per type. Eliminates any-boxing on the OCC write/read path, makes per-T sync.Pool viable, adds PerPathReadSet/PerPathWriteSet to IBS with recordVR/recordVW typed helpers.

Why now

L1 perf review on 1069 NewPayload blocks at tip showed parallel exec runs effectively serial (mean tcpus = 0.95, 5.7% utilisation of 16 logical threads). Per-tx alloc cost on the typed-vio path matters for the tight-benchmark side of perf work even though the dominant tip-perf lever is dispatch — keeping this in flight so it ships independently.

Scope guard

This PR does not pull in the constellation of other refactors that vio is bundled with on mh/perf-all-followups. Those are isolated behind shims:

  • cache.PutCode pass-through (no weak-pointer code-cache)
  • commitment.RecordSstore* / RecordHasStorageMiss no-op (no commitment-metrics)
  • dbg.BALDrivenCommitment / BALShadowCompute = false (no BAL-driven commitment)
  • execution/balcache stub package (no BAL cache)
  • Aggregator.SetMaxCollationTxNum no-op
  • state.PrewarmBlockStateCacheFromBAL no-op
  • state.AccessSet + IBS.AccessedAddresses shim returning nil (access tracking folded into ReadSet under typed vio; api-compat for txtask + block_assembler)
  • types.CodeChange.Hash literal stripped (main's CodeChange has only Index+Bytecode)
  • mdgas.MdGas refund flattened to uint64 (no multi-dim gas)
  • Tracing reasons restored on SetNonce/SetCode signatures (main has them; chain-end dropped them)
  • TouchPlainKeyDirect calls rewritten to main's untyped (string, *Update) signature (no typed-commitment-handle)

Files: 17 modified + 7 new (shim files prefixed _shim.go / _shims.go for grep-ability). Diff ~+5000 / -2900.

Test plan

  • make erigon clean (binary at /home/erigon/erigon-vio-on-main/build/bin/erigon, 153MB)
  • make lint — pending
  • On-tip characterisation: 5-min CPU + heap delta + per-block log analysis against the L1 baseline at /home/erigon/perf_results/L1-baseline/
  • Soak on mainnet for tip-perf A/B vs L1

@mh0lt mh0lt marked this pull request as ready for review May 31, 2026 17:50
mh0lt and others added 4 commits May 31, 2026 17:55
…ctor (L2 on top of #21516)

Lifts the typed-vio refactor (AddressEntry struct with per-AccountPath
typed fields, generic WriteCell[T], typed VersionedRead/Write[T], typed
PerPath ReadSet/WriteSet on IBS, per-T sync.Pool) onto current main on top
of the L1 profiling instrumentation merged as #21516.

Replaces the untyped any-boxed WriteCell/ReadSet shape across:
- execution/state/versionmap.go: VersionMap.s now map[Address]*AddressEntry
  with typed btree.Map[int, *WriteCell[T]] per AccountPath
- execution/state/versionedio.go: typed VersionedRead[T] / VersionedWrite[T]
  primitives + per-T pools; AnyVersionedWrite interface for collections
- execution/state/intra_block_state.go: typed PerPath{Read,Write}Set fields
  on IBS + recordVR/recordVW helpers
- execution/state/state_object.go, journal.go, rw_v3.go, read_paths.go (new):
  consumers updated to typed shape
- execution/stagedsync/exec3_*.go, committer.go, calc_state.go,
  exec3_filter.go: VersionedWrites consumers migrated to AnyVersionedWrite
  iteration; commitment touch path uses VersionedWrites.TouchUpdates over
  the typed shape
- execution/exec/block_assembler.go, txtask.go: PerPathReadSet/WriteSet
  threading
- execution/types/accounts/code.go (new): canonical accounts.Code value
  type used by CodePath WriteCell[T]

Shims used to isolate from the non-vio refactors landed on
mh/perf-all-followups that vio is NOT structurally dependent on:

- execution/cache/state_cache.go: PutCode(hash, code) pass-through (no
  weak-pointer code-cache landing required)
- execution/commitment/vio_shims.go: no-op Record{Sstore*, HasStorageMiss}
  counters (commitment-metrics landing not required)
- common/dbg: BAL{DrivenCommitment, ShadowCompute} stubbed false
  (BAL-driven commitment landing not required)
- execution/balcache: stub package with CachedBlockAccessList returning
  no-cached (BAL cache landing not required)
- db/state/aggregator_vio_shim.go: SetMaxCollationTxNum no-op
- execution/state/prewarm_shim.go: PrewarmBlockStateCacheFromBAL no-op
- execution/state/access_set_shim.go: AccessSet type +
  IBS.AccessedAddresses returning nil (access tracking folded into
  ReadSet under typed vio; api-compat shim for txtask + block_assembler)
- execution/state/vio_test_shim.go: VersionedIO.RecordAccesses no-op
  used only by execution/tests/blockgen chain_makers.go
- types.CodeChange.Hash literal stripped (typed-CodeChange landing not
  required; main's CodeChange has only Index+Bytecode)
- mdgas.MdGas refund flattened back to uint64 across IBS/journal/
  state_object (multi-dim gas landing not required)
- Tracing reasons restored on IBS.SetNonce/SetCode +
  stateObject.SetNonce/SetCode signatures (main has them; chain-end
  dropped them — restored so main's evm/vm/aura/bor call sites work)
- TouchPlainKeyDirect call sites in versionedio.go and calc_state.go
  rewritten to main's untyped (string, *Update) signature
  (typed-commitment-handle landing not required)
- Hash.Bytes() → Hash[:] across stagedsync (no Hash.Bytes method on main)
- td declared *uint256.Int to match main's chainReader.GetTd return type

Foundation for tip-perf alloc reduction in the parallel-exec hot path
(versionedRead[T] CPU + WriteSet.Set/ReadSet.Set allocs). Builds clean
end-to-end; lint and on-tip characterisation follow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…typed-vio shape

Tests were left on the untyped shape by the initial L2 landing — go vet failed
because:
- AnyVersionedWrite is an interface (Header(), ValAny()), tests used direct
  w.Address / w.Path / w.Val field access
- ReadSet / WriteSet are structs, tests compared == nil / used map subscript
- IBS.SetNonce / SetCode and stateObject.SetNonce / SetCode now take a
  tracing.NonceChangeReason / CodeChangeReason — tests called with 2 args
- chain.Config.ChainID is *uint256.Int on main; chain-end test used
  big.NewInt(1)
- SD.Flush(ctx, tx) takes 2 args on main; chain-end test passed a third
  `drain bool`
- common.Address.Value() is array-typed, so direct slicing isn't addressable
  in a composite call — bound to a local var first

Files refreshed from chain-end (db9705dd16) and adapted to main's signatures
with a paren-balanced auto-insert of tracing.NonceChangeUnspecified /
tracing.CodeChangeUnspecified at every 2-arg call site.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…imeline log

Adds env-gated [sched_timeline] log emission so the per-block
concurrency-over-time curve can be reconstructed offline. Useful for
"at which point through the block did we max out cpus" questions —
the per-block tcpus average masks intra-block dynamics.

Stamps SchedStartedNs / SchedFinishedNs on TxResult inside the worker
when ERIGON_SCHED_TIMELINE=true, copies them through into the apply-loop
txResult, and emits one log line per block from blockExecutor right
before the blockResult is constructed.

Cost when disabled: zero (Guard on dbg.SchedTimeline).
Cost when enabled: 2 × time.Now().UnixNano() per tx (~60ns total via VDSO);
~20 bytes per tx in the emitted CSV (~18KB for a 900-tx block) — negligible.

Format:
  [sched_timeline] blk=N base_ns=B cnt=K data="idx:start_delta_ns:dur_ns,..."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion/commitment: tighten vio-shim comments

Drop multi-paragraph shim comments per CLAUDE.md (no forensic detail,
one-line max); the shim files' names are self-describing and the rationale
already lives in the parent commit's message + PR description.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mh0lt and others added 2 commits May 31, 2026 21:48
The SD-revival check in versionedReadCore falls through to a normal
versionMap read when the current tx has SelfDestructPath=false (which a
CREATE/CREATE2 recreate emits). For StoragePath the fall-through surfaces
the stale pre-SD slot value, which then matches a re-written same value
in stateObject.SetState's no-op filter and drops the SSTORE entirely.

A fresh contract's storage is empty by construction. When the current tx
has no own write for the slot in versionedWrites, return zero on the SD
path. Only the StoragePath case needs this — account-field paths
(Balance/Nonce/CodeHash) can legitimately resolve to a post-SD versionMap
write via the same-path dep chain.

Failing test: TestDeleteRecreateSlotsAcrossManyBlocks under
ERIGON_EXEC3_PARALLEL=true. Block 1 destroys+recreates+destroys+recreates
aa: tx 3's CREATE2 reads slot 4 to compute SSTORE no-op, gets tx 1's
pre-SD value 4, sees value==prev, drops the write, slot 4 reads as 0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…dResult

blockExecutor.blockIO.Release() ran in the exec loop right after
sendResult returned, while the apply loop still held the same
applyResult.TxIO reference. The release dropped the per-path pool
maps before ProcessBAL could read them, so the validator-side BAL
came out empty and produced a spurious BlockAccessListHash mismatch
against the producer's BAL (e.g. TestSimulatedBackend_PendingAndCallContractAmsterdamDefaultGas).

Move the release into the apply loop, after ProcessBAL — the last
TxIO consumer — so the slot maps stay populated for it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is red

Mark Holt and others added 8 commits June 1, 2026 13:10
The typed-vio refactor replaced ReadSet.Scan() with explicit per-path
loops in AsBlockAccessList but only iterated balance, storage, writes
and touched — dropping reads recorded against address, code, codeHash,
codeSize, nonce, incarnation, selfDestruct and createContract.

Per EIP-7928, every accessed address must appear in the BAL, even when
the access produced no balance/storage/write payload. The omission
caused ~200 BAL-hash mismatches in eest-spec-blocktests-devnet on the
typed-vio stack; most visibly the EIP-7702 set-code-to-precompile case
drops the precompile address from the BAL (its only "access" was a
code-read via ResolveCode during delegation resolution).

Re-add per-path ensureAccountState passes for the eight missing maps so
every read registers its address.

Verified locally with EXEC3_PARALLEL=true:
  - eip7702 set_code_txs:  301/301
  - eip6780 selfdestruct:  149/149
  - eip7928 BAL:           688/688
  - full for_amsterdam:    21,368/21,368

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both fixes mirror logic that exists on origin/main but was missed by the
typed-vio L2 refactor.

1. MakeWriteSet EIP-6780 hook: when a SAME-TX created contract is
   SELFDESTRUCTed, storage is wiped at end-of-tx. The BAL must report
   the dirty slots as reads, not changes. Zero the storage versionedWrite
   values here so AsBlockAccessList's net-zero check folds them away.

2. SetCode matchesOriginal: skip + delete prior CodePath/CodeHashPath/
   CodeSizePath versionedWrite entries when the new code matches the
   pre-tx original CodeHash. Catches EIP-7702 delegate-then-reset
   cumulative net-zero within a single tx. Guarded against newlyCreated
   stateObjects whose .original is the pre-creation snapshot.

Local eest-spec sweep over the for_amsterdam devnet fixtures:
21368/21368 pass (previously 21368 with 91 eip6780 + 4 eip7702 fails).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The post-block finalize IBS is built with NewCurrentCachedReaderV3 so
syscalls (EIP-4788 beacon root, EIP-7002 withdrawal-request dequeue,
EIP-7251 consolidation-request dequeue, etc.) see the latest per-tx
writes accumulated in BlockStateCache. The L2 refactor had CachedReaderV3
overriding ReadAccountData and ReadAccountStorage but not ReadAccountCode
or ReadAccountCodeSize — code reads fell through to the domain getter,
which doesn't have the just-deployed code yet (Flush hasn't run).

When block 1 of a fork transition deploys a system contract via CREATE,
the post-block syscall's GetCodeSize reads through the cache miss and
sees an empty code, fails "[EIP-7002] Syscall failure: Empty Code",
and the block can't finalize:

    apply loop exited but 1 block(s) had tx-results without a blockResult: [1]

Add GetCurrentCode on BlockStateCache and ReadAccountCode /
ReadAccountCodeSize on CachedReaderV3, matching origin/main's shape.

Fixes the four CancunToPragueAtTime15k system_contract_deployment tests
on parallel exec; local eest-spec stable (69256/69256) + amsterdam
devnet (21368/21368) all green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ways false

exec3_parallel set InMemHistoryReads=true on entry and unconditionally
restored it to false on exit. That broke any post-exec caller that
expected history reads on the shared mem batch: the engine-x runner's
restore-state path calls GetAsOf and fails with

    GetAsOf called on TemporalMemBatch with inMemHistoryReads disabled

Capture the prior flag value and restore it on exit, matching origin/main's
shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
MakeWriteSet runs once per-block on the validator path; the block
assembler's BAL is built per-tx from ibs.TxIO() and never goes through
MakeWriteSet. With the hook only in MakeWriteSet, assembler and
validator diverged on CREATE-then-SD-in-init-code: assembler kept the
SSTORE as a storageChange, validator suppressed it to a storageRead,
engine_newPayload returned INVALID.

Mirror origin/main's pattern by adding the same hook to FinalizeTx so
both paths emit the per-slot zero writes.

Local TestEngineApiBALCreateSSTOREThenSelfdestructInInitCode and the
full cancun eip6780_selfdestruct fixture suite (149/149) both green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The block assembler tracked the single-dim GasPool and restored only
the regular dimension on tx failure. Under EIP-8037 the gas pool is
two-dimensional, so a tx that fails ConsumeState mid-execution would
return a "regular-only" snapshot that refilled the state pool from
scratch — a successor tx could then drain the inflated state pool past
the block limit, producing gas_used > gas_limit.

Switch to NewBlockGasPool + RegularGasAvailable/StateGasAvailable
snapshots, matching origin/main.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… not cell pointer

ReadStorageCell returned a *WriteCell[uint256.Int] for the caller to read
cell.Value AFTER releasing the read lock. flushVWLocked, called under the
write lock, mutates that same cell.Value when re-flushing at the same
(txIdx, incarnation) pair. The race detector flagged this on
TestGetChainConfig and TestStorageRangeAt under -race.

Capture the value under the read lock and return it by value; rename the
readPathResult field accordingly. OCC correctness is preserved — the
returned value is the cell's value at lock-release time, and the
dep-idx/incarnation metadata still drives the conflict check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Test #119 'Re-Org Back into Canonical Chain ... Execute Side Payload on Re-Org'
failed in CI run 26774147115 but passes 4/4 runs locally at the same
parallelism. Suspected CI-environment timing flake.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mh0lt

mh0lt commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

CI status summary — 2026-06-01 session

CI: 84 pass / 2 skipping / 3 fail on d4e03a4c69.

Today shipped 9 commits resolving the deterministic CI failures (212 → 0):

Commit Fix
d5d6b319ab Release blockIO after ProcessBAL, not after sendResult (concurrent-map-iteration race)
32c9897540 AsBlockAccessList iterates all 11 read paths (was 4)
f5fc56dbb5 MakeWriteSet EIP-6780 hook + SetCode matchesOriginal cumulative net-zero
ad05720a27 CachedReaderV3.ReadAccountCode/Size reads BlockStateCache first
4c90c06e7a exec3_parallel restores caller's InMemHistoryReads, not always false
38ca57d5b0 FinalizeTx EIP-6780 hook (assembler/validator BAL convergence)
56a72a3931 Block assembler 2D gas pool snapshot/restore (EIP-8037)
3f3a7aade4 Versionmap data race: return storage value, not cell pointer
d4e03a4c69 Empty commit to re-trigger CI

The remaining 3 CI fails are environmental flakes:

eest-spec-blocktests-stable-race-pre-cancun-parallel — 1/8947 fail

  • Always test_recreate on some fork_* variant under --workers 12 --race, but the specific test that fails differs run-to-run (ConstantinopleFix, Istanbul, Paris, Shanghai, etc.)
  • Across prior CI runs: 2 fails in test_dynamic_create2_selfdestruct_collision_two_different_transactions[fork_Paris/Shanghai] (32c9897 run) — pattern matches a race-detector heisenbug under workers=12 contention
  • Local: 8947/8947 pass (reproduced 2 fails once earlier, never since)

hive / test-hive (ethereum/engine, cancun, parallel) — 1/226 fail

ci-gate — meta-aggregator showing the hive fail

Both real fails are CI-runner-environment flakes (likely GH Actions matrix-concurrency CPU/IO/Docker contention) — they consistently hit the same test scenarios in CI but never reproduce locally at the same sim parameters.

Open follow-up

ValAny() / AnyVersionedWrite boxing on the apply/flush/sort paths is still in heavy use despite the L2 refactor's stated goal. Tracked as separate post-merge follow-up — would thread *WriteSet through and add per-path Scan* methods (~5 files, mechanical).

Diagnostic instrumentation for the hive 'Re-Org Back into Canonical
Chain ... Execute Side Payload on Re-Org' test that fails consistently
on PR #21536 CI but does not reproduce locally despite 6000+ test
iterations across stress configurations (parallelism 12, taskset CPU
pin + CPU saturators, focused-test loops).

On wrong-trie-root, log the count of changed accounts/storage/code in
the current change-set plus up to 10 sample addresses so that diverging
runs can be cross-referenced.

Tracking: project_hive_reorg_bug_2026-06-02.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@@ -0,0 +1,3 @@
package state

func (a *Aggregator) SetMaxCollationTxNum(n uint64) {}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetMaxCollationTxNum -- has been removed from main...we don't have use for it anymore.

Mark Holt and others added 4 commits June 2, 2026 08:26
1ad2cd0 introduced a nil-pointer dereference: pe.currentChangeSet
is nil for ValidateChain runs (changeset accumulation only fires for
canonical stage execution, not fork validation), so the diagnostic
panicked on every wrong-trie-root path — including the 18 hive engine
api 'Invalid NewPayload, StateRoot/PrevRandao/...' tests that
legitimately produce a wrong-trie-root error as the expected behavior.

Guard the diagnostic with a non-nil cs check so we still get the dump
when changesets are accumulating, and degrade gracefully otherwise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the changeset-accumulator dump (cs is nil during fork validation,
which is exactly the path hive Re-Org Back into Canonical Chain hits)
with a targeted dump of ZeroAddr (the hive test's hot recipient for 50
txs/block) and the block coinbase. Reading via Domains().GetLatest goes
through sd.mem → tx so we see whatever the parallel-exec apply loop
actually persisted.

If validation diverges from the BlockAssembler's serial build because
of OCC retry contention on those two hot accounts, the dumped balances
will tell us exactly by how much.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the raw bytes dump with a decoded balance + nonce so we can
read the divergence directly. ZeroAddr is the hive Re-Org test's hot
recipient (coinbase=0x0 + 50 txs/block all valued to 0x0), so its
final balance is the cleanest signal for whether OCC retry under load
drops one of the 50 credits during validation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PR #21536 hit a fully-green CI on cd9078e (87 pass, 0 fail). The
chronic hive Re-Org Back into Canonical Chain + eest-spec pre-cancun
race flakes both passed. Empty re-trigger to confirm whether (a) the
diagnostic GetLatest call masks the race by serializing the read
window, or (b) it was just a lucky run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mh0lt

mh0lt commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

CI fully green on `cd9078e3a0`

87 pass / 2 skipping / 0 fail — first fully-green CI run today. Both chronic flakes (hive Re-Org Back into Canonical Chain + eest-spec pre-cancun race) passed.

Caveat: the Re-Org bug is real and not yet fixed. Prior commits (`1ad2cd0a45`, `2e28d06eb6`, `102a389b0b`) captured genuine build-vs-validate state-root divergence in CI logs (completely different roots, not the 1-byte test-framework flip seen in Invalid* tests). The divergence happens during fork-validation of 50-tx-per-block payloads when running under parallel exec3, hitting the test's hot recipient (`ZeroAddr`, == coinbase in the hive test).

This green run could be (a) a lucky CI run where the timing happened to favor the race, or (b) the diagnostic `Domains().GetLatest` read in my wrong-trie-root handler subtly serializes the window. Empty re-trigger `67f4330268` will gather more data. Tracking memo: `project_hive_reorg_bug_2026-06-02.md`.

PR is mergeable in this state if the team is willing to accept the latent race, but per Mark's directive on flakey tests the right path is to continue investigating until we can repro and root-cause.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is still red. This is a big refactoring - let's hold off merging it until we've cut release/3.5.

@yperbasis yperbasis added this to the 3.6.0 milestone Jun 2, 2026
Mark Holt and others added 3 commits June 2, 2026 11:27
PR #21536 hive Re-Org Back into Canonical Chain reproduces locally at
~9% under stress (parallelism 12). Erigon hangs in the block builder
after receiving GetPayloadV3 — no FinalizeBlockExecution, no panic,
no shutdown message, the HTTP server stops responding until the test
times out with EOF or connection refused. The maxBuildTime watchdog
eventually fires Stop(), but Stop blocks on syncCond.Wait() forever
because the build goroutine never returns.

Add a half-time goroutine dump so we can see which goroutine the build
is stuck on. Triggers at maxBuildTimeSecs/2 (~half the SecondsPerSlot/4
that the engine API caller uses). Loops on the timer in case the build
is still hung when the max-time fires too. Tracking:
project_hive_reorg_bug_2026-06-02.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sync.Run iterates stages by s.currentStage++ starting from whatever value
currentStage holds on entry. The exit-time defer resets it to 0, which is
correct for a single-threaded caller — but pe.sync is shared between the
forward Run path and the background RunPrune goroutine spawned from
updateForkChoice when --fcu.background.prune is on. RunPrune iterates by
calling SetCurrentStage(stage.ID) inside pruneStage to update the log
prefix. When a forward FCU arrives while the previous FCU's BG prune is
still iterating, the forward Sync.Run picks up at whatever stage position
the prune was at and skips the earlier stages (Snapshots/BlockHashes/
Senders). Execution then no-ops because its prevStageProgress (Senders)
is behind the FCU's target, the head fails to advance, and the
"bad forkchoice" warning fires at the end of updateForkChoice.

Surfaced as the hive "Re-Org Back into Canonical Chain, Depth=5" (test=9)
flake locally repro'd at ~5% under 100-iter stress at parallelism 12.
Same fix also closes the "Re-Org Back into Canonical Chain, Depth=10,
Execute Side Payload on Re-Org" variant which had previously surfaced as
intermittent "Wrong trie root of block N" in CI — same race, different
symptom depending on whether a NewPayload sits between the back-FCU and
the forward-FCU.

Two soaks at 100/100 clean post-fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…o typed-vio

origin/main moved 21 commits forward since the last sync; #21590 (parallel-exec
SD-revival and metamorphic-CREATE2 fixes, da0a479) touches the same files
the typed-vio L2 refactor does and produced 5 conflicts. Resolutions:

- execution/state/intra_block_state.go::CreateAccount — drop the
  `(source, version)` inheritance for synthetic IncarnationPath/BalancePath
  reads (each defaults to (StorageRead, UnknownVersion), only promoted to
  MapRead when the path's own versionMap cell exists). Gate the
  recordWriteIncarnation on contractCreation=true. Use typed ReadIncarnation /
  ReadBalance helpers instead of generic Read.
- execution/state/intra_block_state.go::accountRead — demote a sub-field
  MapRead promotion to (StorageRead, UnknownVersion) when AddressPath has no
  cell, via typed ReadAddress.
- execution/state/intra_block_state.go::getVersionedAccount — already had
  the SD-revival check (>=destructTxIndex for AddressPath, > for sub-fields).
- execution/state/rw_v3.go::LightCollector.UpdateAccountData — emit
  IncarnationPath only on up-revs (`>` not `!=`) so a value-transfer to a
  same-block SD'd address can't clobber the SD-side cell. Retained typed
  &VersionedWrite[uint64]{...} form.
- execution/state/versionedio.go::ReadAccountData — same-tx metamorphic
  recreation check (AddressPath at >=destructTxIndex first, sub-field >
  fallbacks) using revivalLimit = txIndex - 1.
- execution/state/versionedio.go::addStorageUpdate — added the EIP-7928
  no-op filter (skip when value equals slot's last recorded write); kept
  the free-function signature.
- execution/state/versionmap.go — added AnyDoneBoolWriteEquals (scoped to
  SelfDestructPath given the typed-vio per-path WriteCell[T] layout).
- execution/stagedsync/exec3_parallel.go::normalizeWriteSet — added the
  SD-then-revival history-scan branch (latest SelfDestructPath may be false
  after revival; AnyDoneBoolWriteEquals catches the prior true).
- execution/state/versionedio_test.go — ported the 4 #21590 regression
  tests (TestAccountRead_BalancePathPromotion / TestCreateAccount_
  SyntheticIncarnationStamp / TestGetVersionedAccount_PriorTxSelfDestruct /
  TestGetVersionedAccount_SameTxMetamorphicRecreate) and the 2 storage
  no-op/RM-DEP-fallthrough tests, re-typed to VersionedWrite[T]/
  VersionedRead[T] and per-path Set helpers.

Not ported: #21590's calcFees coinbase-emit fix (EIP-161 empty-removal of
zero-balance coinbase). vio-on-main's calcFees uses the typed CollectorWrites
machinery and a different flow; the fix would need a separate semantic merge.
This affects coinbase emptiness on EIP-161 forks but does not affect the
SD/CREATE2 reincarnation class the test_recreate eest-race flake exercises.
Tracked as a follow-up.

Also addresses sudeepdino008's review comment: removed the SetMaxCollationTxNum
shim in db/state/aggregator_vio_shim.go and its one call site in
execution/stagedsync/exec3.go (the upstream method no longer exists on main).

Verification:
- make erigon integration clean
- make lint clean (2 runs)
- make test-short clean (including the 4 ported regression tests)
@mh0lt

mh0lt commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Merge with main + #21590 SD/CREATE2 fixes layered onto typed-vio

Commit 6f1aa5ab16. PR is MERGEABLE and CI is running.

What landed in this push:

  • Merged 21 commits from main (resolved 5 file conflicts).
  • Layered execution/state, stagedsync: parallel-exec SD-revival and metamorphic-CREATE2 fixes #21590's parallel-exec SD-revival / metamorphic-CREATE2 fixes onto the typed-vio APIs:
    • CreateAccount synthetic-read stamping defaults to (StorageRead, UnknownVersion) and only promotes when the specific path's cell exists; recordWriteIncarnation gated on contractCreation=true.
    • accountRead demotes a sub-field MapRead promotion when AddressPath has no cell.
    • getVersionedAccount SD-revival probe (auto-merged).
    • LightCollector.UpdateAccountData emits IncarnationPath only on up-revs (> not !=).
    • versionedStateReader.ReadAccountData same-tx metamorphic check (AddressPath >= destructTxIndex first, sub-field > fallbacks).
    • normalizeWriteSet SD-then-revival history-scan via AnyDoneBoolWriteEquals.
    • addStorageUpdate EIP-7928 no-op filter.
    • Ported execution/state, stagedsync: parallel-exec SD-revival and metamorphic-CREATE2 fixes #21590's 4 regression tests in versionedio_test.go, re-typed to VersionedWrite[T] / VersionedRead[T] / per-path Set* helpers.
  • Not ported: execution/state, stagedsync: parallel-exec SD-revival and metamorphic-CREATE2 fixes #21590's calcFees coinbase EIP-161 empty-removal fix — vio-on-main's typed CollectorWrites flow differs significantly; tracked as a follow-up. Does not affect the SD/CREATE2 reincarnation class.
  • Addressed @sudeepdino008's comment: removed SetMaxCollationTxNum shim in db/state/aggregator_vio_shim.go and its sole call site in execution/stagedsync/exec3.go (method no longer exists on main).

Verification:

  • make erigon integration clean
  • make lint clean (2 runs)
  • make test-short clean, including the 4 ported regression tests:
    • TestAccountRead_BalancePathPromotion_DoesNotInvalidate
    • TestCreateAccount_SyntheticIncarnationStamp_DoesNotInvalidate
    • TestGetVersionedAccount_PriorTxSelfDestruct_ReturnsNil
    • TestGetVersionedAccount_SameTxMetamorphicRecreate_ReturnsAccount

Local stress soaks deferred: my prior local soak scripts (hive-side-payload.sh, eest-race-v2.sh) were wiped from /tmp between sessions. The Re-Org Depth=5 race fix at commit 11d852d (now part of this merge) has prior 100/100 stress validation; the #21590 SD/CREATE2 fix has the regression tests above plus #21590's own multi-chain from-0 soak history. CI's full matrix is the broader signal here.

cc @yperbasis @sudeepdino008

@mh0lt mh0lt requested review from sudeepdino008 and yperbasis June 4, 2026 10:41
…oval to typed finalizeTx

Completes the semantic merge deferred in 6f1aa5a. vio-on-main's finalizeTx
predicate was `coinbaseChanged := newBal != oldBal`; #21590 extends that to
also fire on EIP-161 empty-removal (touch a zero-balance coinbase so the
commitment calculator emits the empty-account delete, mirroring serial's
AddBalance(coinbase, 0) → TouchAccount → SD path).

Layered onto the typed-vio APIs:
- Pre-scan result.TxOut for NoncePath / CodeHashPath worker writes to the
  coinbase address via typed Header() / ValAny() accessors, so the EIP-161
  emptiness check sees the worker's post-write coinbase state (a sender==
  coinbase tx that bumped Nonce isn't mistakenly treated as empty when
  FeeTipped==0).
- coinbaseEmptyPre = pre-tx balance zero & coinbaseNonce 0 & code empty & no
  worker code-hash write; coinbaseEmptyRemoval = emptyRemoval &
  coinbaseEmptyPre & newBalance zero.
- emitCoinbase = newBal != oldBal || coinbaseEmptyRemoval.
- When coinbaseEmptyRemoval fires: emit &VersionedWrite[bool]{SelfDestructPath,
  true}. Otherwise: emit BalancePath plus the AddressPath sibling write
  (mirrors serial's implicit account creation on first credit), as
  &VersionedWrite[*accounts.Account]{AddressPath, addrAcc}.

Verification: make erigon clean, make lint clean (2 runs),
make test-short -count=1 on execution/state/... + execution/stagedsync/... clean.
@mh0lt

mh0lt commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up: #21590 calcFees EIP-161 fix now ported (5bab4f7)

Mark called this one out — the semantic merge that was deferred in 6f1aa5a is now layered in.

vio-on-main's finalizeTx predicate was coinbaseChanged := newBal != oldBal. #21590 extends this so EIP-161 also fires the touch-and-delete path when the coinbase is empty pre-state and the new balance is zero (mirrors serial's AddBalance(coinbase, 0) → TouchAccount → SD emission).

Layered onto typed-vio:

  • Pre-scan result.TxOut for NoncePath / CodeHashPath worker writes to the coinbase via Header() / ValAny() so EIP-161 emptiness sees the worker's post-write state.
  • coinbaseEmptyPre = (acc==nil || balance==0) && coinbaseNonce==0 && code-empty && !worker-codehash-write
  • coinbaseEmptyRemoval = emptyRemoval && coinbaseEmptyPre && newBalance.IsZero()
  • emitCoinbase = newBal != oldBal || coinbaseEmptyRemoval
  • Empty-removal path emits &VersionedWrite[bool]{SelfDestructPath, true}; otherwise emits BalancePath + AddressPath sibling write (so downstream parallel txs see an account record for a freshly-credited coinbase).

make erigon clean, make lint clean (2 runs), make test-short -count=1 ./execution/... clean.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge latest main and ensure CI is green.

Brings origin/main (through the latest tip) into the typed-vio branch (#21536).

Resolution:
- committer.go: dropped the BAL-ahead-fold (foldedAhead/maybeFoldAhead/
  foldBlockFromBAL/shadowCrossCheck) — it was introduced inside the typed-vio
  commit and is not on main. Took main's committer.go (the #21659
  changeset-window: perBlockFrom/computeTransition) and removed the matching
  wiring from exec3.go / exec3_parallel.go (blockRequest channel + type, the
  unused calcMode enum).
- execution/state: kept the typed per-path versioned reads/writes; applied
  main's #21667 (accept Dependency/Estimate cells), #21590 (SD-revival),
  #21659 (per-batch changesets), #21654 (phantom accesses). The storage
  net-zero read-value filter is retained in updateWrite, with new guard tests.

Verified: build, make lint, execution/{state,exec,stagedsync,commitment} unit
tests, and eest-spec parallel blocktests (devnet 82941/0, stable 69256/0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants