execution: fix parallel-exec SD-recreate flake + validator false positives#21294
Conversation
…lse positives
TestDeleteRecreateSlotsAcrossManyBlocks under GOMAXPROCS=2 -race flaked
~3-5% with EXEC3_PARALLEL=true, in two modes: a +1602 gas mismatch (a
CREATE2 collision burned inner gas on a stale CodeHash read) and a
runaway validator-invalid retry loop.
Changes:
execution/state/intra_block_state.go
- getStateObject initialised accountVersion to sdb.Version() (the
worker's own incarnation). refreshVersionedAccount's sub-read
updates are gated on `bversion >= readVersion`, so every prior-tx
write at a lower version was silently discarded, leaving
accountSource pinned to StorageRead with a worker-private version
vm never wrote. Switched to UnknownVersion so sub-reads promote.
- Removed the synthetic AddressPath read recorded by accountRead: it
borrowed the version that flowed through refreshVersionedAccount
(often a sub-read's), tripping the validator's path==AddressPath
checks against entries vm never wrote. Sub-reads are recorded by
versionedRead directly.
execution/state/versionedio.go
- Exclude AddressPath from the readStorage==nil MVReadResultNone probe
recording. getStateObject probes vm for AddressPath then falls back
to the stateReader (the account does exist); recording the nil
probe as "we read nothing" made the validator's path==Address
BalancePath-presence check invalidate any pre-existing account a
prior tx wrote BalancePath for.
execution/state/versionmap.go
- validateRead split into validateReadImpl with a `recursive` flag.
Two false-positive paths in the cross-validate recursion produce
spurious Invalid when invoked as a synthetic probe (readVal nil,
source borrowed): the path==AddressPath BalancePath-presence check,
and the source!=MapRead+nil-readVal tiebreaker. Both gated on
!recursive.
- Added an SD-staleness cross-check in the MVReadResultDone branch:
a later tx self-destructing without revival means the read predates
the destruct and the serial path returns zero via the SD-zero
short-circuit; checkVersion alone misses it because SD doesn't
write the read's own path. revivalLimit = txIndex-1 excludes the
validating tx's own writes (validation runs post-flush).
execution/stagedsync/exec3_parallel.go, exec3_status.go
- On ErrDependency with no effective blocker (named blocker already
complete) AND on validator-invalid, push the tx to a new deferred[]
queue instead of immediate pending. scheduleExecution's drain
predicate gates retry on (predecessor validated) AND (no worker at
a lower index in flight) — lower-indexed workers' flushes land at
indices visible to N's reads via vm.Read's floor(N-1). Non-deferred
txs keep dispatching speculatively. A safety net force-drains when
pending is empty and no workers are in flight.
- tooManyRetries helper + cap on validator-invalid retries, mirroring
the worker-error path, so a genuine loop can't spin forever.
Stress: 200/200 on TestDeleteRecreateSlotsAcrossManyBlocks (was
95-97/100 baseline); no regression on the EEST blocktest suite under
the 12-worker parallel runner.
468c98b to
b23da1a
Compare
…r-validator # Conflicts: # execution/state/versionmap.go
There was a problem hiding this comment.
Pull request overview
This PR targets flakiness in parallel execution (ERIGON_EXEC3_PARALLEL=true) by reducing validator/recording false positives around account reads (especially AddressPath) and by improving retry scheduling/limits to avoid runaway invalidation loops. It adjusts both state read validation semantics (including SELFDESTRUCT staleness) and the exec3 scheduler’s retry handling.
Changes:
- Refines
VersionMapread validation by introducing a recursive-aware validation path and adding an SD-staleness cross-check to better match serial semantics. - Adjusts versioned read recording to avoid recording synthetic
AddressPathnil-probes, and fixesgetStateObjectversion propagation/recording behavior to reduce validator false positives. - Adds a deferred retry queue plus a validator-invalid retry cap to reduce immediate re-dispatch races and prevent infinite retry loops.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| execution/state/versionmap.go | Adds validateReadImpl(recursive) and SD-staleness validation; adjusts cross-validation behavior. |
| execution/state/versionedio.go | Avoids recording AddressPath nil-probes when readStorage == nil to prevent validator false positives. |
| execution/state/intra_block_state.go | Fixes account version initialization (uses UnknownVersion) and removes synthetic AddressPath read recording in getStateObject. |
| execution/stagedsync/exec3_status.go | Introduces deferred queue support in exec task status tracking. |
| execution/stagedsync/exec3_parallel.go | Adds tooManyRetries helper, defers certain retries, drains deferred work under a predicate, and caps validator-invalid retry loops. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Covers the SD-staleness cross-check in validateReadImpl: a version- consistent MapRead is invalidated when a later tx self-destructed the account with no revival, and stays valid when a revival write follows the destruct. Also documents the valuesEqual tiebreaker branch.
b23da1a excluded AddressPath from the readStorage==nil MVReadResultNone probe recording to stop the validator's path==AddressPath branch over-invalidating on ordinary BalancePath/NoncePath writes. That branch since switched to the precise IncarnationPath signal, so the exclusion is no longer needed — and it is harmful: the unrecorded probe leaves ValidateVersion unable to detect account create/destruct conflicts, so parallel workers commit against stale account state and the parallel-built block access list diverges from the header. Surfaced on the EEST bal-devnet (Amsterdam EIP-7928) blocktests: ~607 "block access list mismatch" failures under the parallel runner; serial unaffected. Restoring the probe recording clears them. TestDeleteRecreateSlotsAcrossManyBlocks stays 100/100 under GOMAXPROCS=2 -race — the IncarnationPath check carries the validator correctness the exclusion was added to protect.
b23da1a changed getStateObject to (a) seed refreshVersionedAccount with UnknownVersion and (b) drop the synthetic accountRead AddressPath recording, to quell validator path==AddressPath false positives behind the SD-recreate flake fix. Both are now redundant: the merged main carries #21286 (exclude own-tx writes from the SD revival check), which independently covers the +1602 / CREATE2-collision flake mode, and the validator already keys on the precise IncarnationPath signal. They are also harmful — without the accountRead recording the account is missing from the parallel- built block access list, so selfdestruct/create-destruct family blocks fail "block access list mismatch" under the parallel runner (serial unaffected). Restoring the original recording clears ~102 EEST bal-devnet (Amsterdam) failures. TestDeleteRecreateSlotsAcrossManyBlocks stays 200/200 under GOMAXPROCS=2 (100 with -race) — the flake fix is carried by #21286 and b23da1a's validator/scheduler changes.
|
Updated — rebased on current Re-merge: the branch was 39 commits behind; merged
Verification (local, parallel runner):
Also added the SD-staleness |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (2)
execution/stagedsync/exec3_status.go:124
- removeFromList can panic with an out-of-bounds access when
vis not present andsort.SearchIntsreturnslen(l): the conditionx == -1 || l[x] != vwill evaluatel[x]even whenx == len(l).sort.SearchIntsnever returns -1. This should checkx == len(l)before indexing (e.g.,if x == len(l) || l[x] != v { ... }).
func removeFromList(l []int, v int, expect bool) []int {
x := sort.SearchInts(l, v)
if x == -1 || l[x] != v {
if expect {
panic(errors.New("should not happen - element expected in list"))
}
return l
}
execution/state/versionedio.go:1053
- PR description says AddressPath should be excluded from the
readStorage==nilMVReadResultNone probe recording, but the current logic records all paths except CodePath (so AddressPath probes are still recorded) and the updated comment asserts they MUST be recorded. Please align the implementation and/or PR description so reviewers/users know whether AddressPath probes are intended to be recorded.
if readStorage == nil {
// Record reads so that ValidateVersion can detect when a prior
// transaction modifies any account property. Without tracking
// these reads, validation misses conflicts where a prior tx
// changes an account's balance/nonce/etc. — causing later txs
// to execute against stale data, and the parallel-built block
// access list (AsBlockAccessList) to diverge from the header.
//
// AddressPath probes MUST be recorded: getStateObject probes the
// versionMap for AddressPath before falling back to the
// stateReader. The validator's path==AddressPath branch cross-
// checks the precise IncarnationPath signal (account create /
// destruct), so the recorded probe does not over-invalidate on
// ordinary BalancePath/NoncePath writes.
//
// Do NOT cache CodePath: getStateObject calls versionedRead for
// CodePath with readStorage=nil to check if a prior tx wrote
// code (EIP-7702). Caching defaultV (nil) would poison the
// ReadSet, causing subsequent getCode calls (which pass a real
// readStorage callback) to return empty code instead of loading
// it from the DB — breaking deposit contract execution, etc.
if !commited && path != CodePath {
vr.Source = StorageRead
vr.Val = defaultV
if s.versionedReads == nil {
s.versionedReads = ReadSet{}
}
s.versionedReads.Set(vr)
}
return defaultV, UnknownSource, UnknownVersion, nil
| if source != MapRead { | ||
| // When BAL is present, significant writes for BalancePath, | ||
| // NoncePath, CodePath and StoragePath are pre-populated in the | ||
| // VersionMap before execution. If a read of one of those paths | ||
| // was from storage (no VersionMap entry at execution time) but | ||
| // the VersionMap now has an entry from a concurrent worker | ||
| // flush, the entry is a BAL-filtered no-op write and the read | ||
| // value is still correct. | ||
| // | ||
| // AddressPath and other paths are NOT pre-populated by the BAL, | ||
| // so a new VersionMap entry means a real state change from a | ||
| // concurrent worker (e.g. account creation) and must trigger | ||
| // invalidation. | ||
| // VersionMap before execution. | ||
| isBALPrePopulatedPath := path == BalancePath || path == NoncePath || | ||
| path == CodePath || path == StoragePath | ||
| if !vm.HasBAL || !isBALPrePopulatedPath { | ||
| // Value tiebreaker: if the StorageRead value matches the | ||
| // versionMap Done value, the read is still valid despite | ||
| // the source mismatch. This avoids unnecessary invalidation | ||
| // when a prior TX wrote the same value that was in storage. | ||
| if readVal != nil && rr.Value() != nil && valuesEqual(path, readVal, rr.Value()) { | ||
| // Values match — read is valid | ||
| if recursive && readVal == nil { | ||
| // Synthetic probe — outer entry's own validation covers it. | ||
| } else if readVal != nil && rr.Value() != nil && valuesEqual(path, readVal, rr.Value()) { | ||
| // Value tiebreaker: a Done entry now exists where the read | ||
| // saw storage, but it holds the same value — read stays valid. | ||
| } else { | ||
| valid = VersionInvalid | ||
| } |
There was a problem hiding this comment.
Good catch — the recursive cross-validate probes are indeed inert no-ops here. This is currently sound (account-lifecycle detection is carried by the recording-side SelfDestructPath dependency, the explicit ungated IncarnationPath check in the path==AddressPath branch, and the revival-aware SD-staleness check — the SD/Amsterdam-EIP-7928 suites are green). Rather than resurrect the recursive probe in this PR (a naive un-gate over-fires on every in-block-created account), I've filed #21318 to revisit validateReadImpl's cross-validation as part of the parallel-exec validator cleanup.
|
CI is fully green now (68/68 checks passing) — including the previously-red |
yperbasis
left a comment
There was a problem hiding this comment.
Approving, but please take a look at the latest copilot comment.
Two conflicts: 1) execution/state/versionmap.go — main split validateRead into a thin wrapper plus a new validateReadImpl(..., recursive bool) so recursive cross-validates can opt out of the source!=MapRead+nil- readVal tiebreaker (PR #21294). The AddressPath cross-check call conflicted with moksha's expanded comment (CreateContractPath + the deliberate EOA-creation-via-CALL+value gap, post-incarnation removal). Resolution: keep moksha's comment + the CreateContractPath signal, switch the cross-validate call to the new validateReadImpl(..., true) signature. Main also added IncarnationPath to the new "path != ... else cross-check" guard at line ~422; on moksha IncarnationPath does not exist, so the term is dropped. 2) execution/state/state_test.go — main extended SetCode with a tracing.CodeChangeReason argument and added a setIncarnation(1) call to the test scaffold. moksha already adopted the new SetCode signature from main; setIncarnation is gone with the rest of the incarnation removal. Resolution: take main's SetCode call, drop the setIncarnation line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
TestDeleteRecreateSlotsAcrossManyBlocksflaked ~3–5% underGOMAXPROCS=2 -racewithERIGON_EXEC3_PARALLEL=true, in two modes: a +1602 gas mismatch (a CREATE2 collision burned inner gas on a stale CodeHash read) and a runaway validator-invalid retry loop. Several distinct validator/recording false positives contributed; this PR fixes them in turn.getStateObjectinitialisedaccountVersiontosdb.Version()(the worker's own incarnation), gating out every prior-tx sub-read update inrefreshVersionedAccountand pinning source/version to a value vm never wrote. Switched toUnknownVersion.accountReadsynthesized an AddressPath read using the version that flowed throughrefreshVersionedAccount— usually a sub-read's (Balance/Nonce/etc) version. The validator'spath==AddressPathchecks then fired against an entry vm never wrote. Removed the synthetic recording; sub-reads are already recorded byversionedRead.versionedReadMVReadResultNone: exclude AddressPath from thereadStorage==nilprobe recording.getStateObjectprobes vm then falls back to the stateReader (account does exist); the "we read nothing" recording made the validator's path==Address branch invalidate any pre-existing account that a prior tx in the block wrote BalancePath for.validateReadwrapped invalidateReadImplwith arecursiveflag. Two false-positive paths in the cross-validate recursion produce spurious Invalid when invoked as a synthetic probe (readVal nil, source borrowed): the path==AddressPath BalancePath-presence check, and thesource!=MapRead+nil-readValtiebreaker. Both gated on!recursive.validateReadalso gained an SD-staleness cross-check in theMVReadResultDonebranch: a later tx self-destructing without revival means the read predates the destruct and the serial path returns zero via the SD-zero short-circuit;checkVersionalone misses this because SD doesn't write the read's own path.revivalLimit = txIndex - 1excludes our own writes (validation runs post-flush).deferred[]queue instead ofpending.scheduleExecutiondrains a deferred tx only when its predecessor has validated AND no worker at a lower index is in flight. Lower-indexed workers' flushes land at indices visible to N's reads viavm.Read'sfloor(N-1); higher-indexed ones don't. Non-deferred txs keep dispatching speculatively. A safety net force-drains when pending is empty and no workers are in flight.tooManyRetrieshelper + cap on validator-invalid retries (mirrors the worker-error path) so a genuine loop can't spin forever.#21286 (yperbasis) fixes the same
revivalLimit = txIndex - 1insight in the runtime SD short-circuit (versionedRead+versionedStateReader.ReadAccountData); this PR fixes the validator/recording side. The two are complementary — the runtime fix prevents the bad read at source, this PR prevents the validator from generating false-positive invalidations on legitimate reads.Test plan
make lintclean (run multiple times).TestDeleteRecreateSlotsAcrossManyBlocksunderGOMAXPROCS=2 ERIGON_EXEC3_PARALLEL=true go test -race(vs ~95-97/100 baseline).origin/main.cancunblocktests (8389) +eip4844_blobsrace-build (6991) + the previously-affectedtest_sufficient_balance_blob_tx_pre_fund_tx(864) all pass under the 12-worker parallel runner.