Skip to content

execution: fix parallel-exec SD-recreate flake + validator false positives#21294

Merged
mh0lt merged 6 commits into
mainfrom
mh/parallel-exec-defer-validator
May 20, 2026
Merged

execution: fix parallel-exec SD-recreate flake + validator false positives#21294
mh0lt merged 6 commits into
mainfrom
mh/parallel-exec-defer-validator

Conversation

@mh0lt

@mh0lt mh0lt commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

TestDeleteRecreateSlotsAcrossManyBlocks flaked ~3–5% under GOMAXPROCS=2 -race with ERIGON_EXEC3_PARALLEL=true, in two modes: a +1602 gas mismatch (a CREATE2 collision burned inner gas on a stale CodeHash read) and a runaway validator-invalid retry loop. Several distinct validator/recording false positives contributed; this PR fixes them in turn.

  • getStateObject initialised accountVersion to sdb.Version() (the worker's own incarnation), gating out every prior-tx sub-read update in refreshVersionedAccount and pinning source/version to a value vm never wrote. Switched to UnknownVersion.
  • accountRead synthesized an AddressPath read using the version that flowed through refreshVersionedAccount — usually a sub-read's (Balance/Nonce/etc) version. The validator's path==AddressPath checks then fired against an entry vm never wrote. Removed the synthetic recording; sub-reads are already recorded by versionedRead.
  • versionedRead MVReadResultNone: exclude AddressPath from the readStorage==nil probe recording. getStateObject probes vm then falls back to the stateReader (account does exist); the "we read nothing" recording made the validator's path==Address branch invalidate any pre-existing account that a prior tx in the block wrote BalancePath for.
  • validateRead wrapped in validateReadImpl with a recursive flag. Two false-positive paths in the cross-validate recursion produce spurious Invalid when invoked as a synthetic probe (readVal nil, source borrowed): the path==AddressPath BalancePath-presence check, and the source!=MapRead+nil-readVal tiebreaker. Both gated on !recursive.
  • validateRead also gained an SD-staleness cross-check in the MVReadResultDone branch: a later tx self-destructing without revival means the read predates the destruct and the serial path returns zero via the SD-zero short-circuit; checkVersion alone misses this because SD doesn't write the read's own path. revivalLimit = txIndex - 1 excludes our own writes (validation runs post-flush).
  • Deferred queue: on ErrDependency with no effective blocker, OR validator-invalid, push to a new deferred[] queue instead of pending. scheduleExecution drains a deferred tx only when its predecessor has validated AND no worker at a lower index is in flight. Lower-indexed workers' flushes land at indices visible to N's reads via vm.Read's floor(N-1); higher-indexed ones don't. Non-deferred txs keep dispatching speculatively. A safety net force-drains when pending is empty and no workers are in flight.
  • tooManyRetries helper + cap on validator-invalid retries (mirrors the worker-error path) so a genuine loop can't spin forever.

#21286 (yperbasis) fixes the same revivalLimit = txIndex - 1 insight in the runtime SD short-circuit (versionedRead + versionedStateReader.ReadAccountData); this PR fixes the validator/recording side. The two are complementary — the runtime fix prevents the bad read at source, this PR prevents the validator from generating false-positive invalidations on legitimate reads.

Test plan

  • make lint clean (run multiple times).
  • 200/200 pass on TestDeleteRecreateSlotsAcrossManyBlocks under GOMAXPROCS=2 ERIGON_EXEC3_PARALLEL=true go test -race (vs ~95-97/100 baseline).
  • 100/100 pass on the same test rebased onto a clean origin/main.
  • No EEST regression: cancun blocktests (8389) + eip4844_blobs race-build (6991) + the previously-affected test_sufficient_balance_blob_tx_pre_fund_tx (864) all pass under the 12-worker parallel runner.
  • CI green on race-tests + EEST parallel shards.

@mh0lt mh0lt requested a review from yperbasis as a code owner May 20, 2026 00:32
…lse positives

TestDeleteRecreateSlotsAcrossManyBlocks under GOMAXPROCS=2 -race flaked
~3-5% with EXEC3_PARALLEL=true, in two modes: a +1602 gas mismatch (a
CREATE2 collision burned inner gas on a stale CodeHash read) and a
runaway validator-invalid retry loop.

Changes:

  execution/state/intra_block_state.go
    - getStateObject initialised accountVersion to sdb.Version() (the
      worker's own incarnation). refreshVersionedAccount's sub-read
      updates are gated on `bversion >= readVersion`, so every prior-tx
      write at a lower version was silently discarded, leaving
      accountSource pinned to StorageRead with a worker-private version
      vm never wrote. Switched to UnknownVersion so sub-reads promote.
    - Removed the synthetic AddressPath read recorded by accountRead: it
      borrowed the version that flowed through refreshVersionedAccount
      (often a sub-read's), tripping the validator's path==AddressPath
      checks against entries vm never wrote. Sub-reads are recorded by
      versionedRead directly.

  execution/state/versionedio.go
    - Exclude AddressPath from the readStorage==nil MVReadResultNone probe
      recording. getStateObject probes vm for AddressPath then falls back
      to the stateReader (the account does exist); recording the nil
      probe as "we read nothing" made the validator's path==Address
      BalancePath-presence check invalidate any pre-existing account a
      prior tx wrote BalancePath for.

  execution/state/versionmap.go
    - validateRead split into validateReadImpl with a `recursive` flag.
      Two false-positive paths in the cross-validate recursion produce
      spurious Invalid when invoked as a synthetic probe (readVal nil,
      source borrowed): the path==AddressPath BalancePath-presence check,
      and the source!=MapRead+nil-readVal tiebreaker. Both gated on
      !recursive.
    - Added an SD-staleness cross-check in the MVReadResultDone branch:
      a later tx self-destructing without revival means the read predates
      the destruct and the serial path returns zero via the SD-zero
      short-circuit; checkVersion alone misses it because SD doesn't
      write the read's own path. revivalLimit = txIndex-1 excludes the
      validating tx's own writes (validation runs post-flush).

  execution/stagedsync/exec3_parallel.go, exec3_status.go
    - On ErrDependency with no effective blocker (named blocker already
      complete) AND on validator-invalid, push the tx to a new deferred[]
      queue instead of immediate pending. scheduleExecution's drain
      predicate gates retry on (predecessor validated) AND (no worker at
      a lower index in flight) — lower-indexed workers' flushes land at
      indices visible to N's reads via vm.Read's floor(N-1). Non-deferred
      txs keep dispatching speculatively. A safety net force-drains when
      pending is empty and no workers are in flight.
    - tooManyRetries helper + cap on validator-invalid retries, mirroring
      the worker-error path, so a genuine loop can't spin forever.

Stress: 200/200 on TestDeleteRecreateSlotsAcrossManyBlocks (was
95-97/100 baseline); no regression on the EEST blocktest suite under
the 12-worker parallel runner.
@mh0lt mh0lt force-pushed the mh/parallel-exec-defer-validator branch from 468c98b to b23da1a Compare May 20, 2026 07:11
@yperbasis yperbasis requested review from Copilot and taratorio May 20, 2026 07:12
…r-validator

# Conflicts:
#	execution/state/versionmap.go

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets flakiness in parallel execution (ERIGON_EXEC3_PARALLEL=true) by reducing validator/recording false positives around account reads (especially AddressPath) and by improving retry scheduling/limits to avoid runaway invalidation loops. It adjusts both state read validation semantics (including SELFDESTRUCT staleness) and the exec3 scheduler’s retry handling.

Changes:

  • Refines VersionMap read validation by introducing a recursive-aware validation path and adding an SD-staleness cross-check to better match serial semantics.
  • Adjusts versioned read recording to avoid recording synthetic AddressPath nil-probes, and fixes getStateObject version propagation/recording behavior to reduce validator false positives.
  • Adds a deferred retry queue plus a validator-invalid retry cap to reduce immediate re-dispatch races and prevent infinite retry loops.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
execution/state/versionmap.go Adds validateReadImpl(recursive) and SD-staleness validation; adjusts cross-validation behavior.
execution/state/versionedio.go Avoids recording AddressPath nil-probes when readStorage == nil to prevent validator false positives.
execution/state/intra_block_state.go Fixes account version initialization (uses UnknownVersion) and removes synthetic AddressPath read recording in getStateObject.
execution/stagedsync/exec3_status.go Introduces deferred queue support in exec task status tracking.
execution/stagedsync/exec3_parallel.go Adds tooManyRetries helper, defers certain retries, drains deferred work under a predicate, and caps validator-invalid retry loops.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/state/versionmap.go
Comment thread execution/state/versionmap.go

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is red

mh0lt added 4 commits May 20, 2026 08:10
Covers the SD-staleness cross-check in validateReadImpl: a version-
consistent MapRead is invalidated when a later tx self-destructed the
account with no revival, and stays valid when a revival write follows
the destruct. Also documents the valuesEqual tiebreaker branch.
b23da1a excluded AddressPath from the readStorage==nil
MVReadResultNone probe recording to stop the validator's
path==AddressPath branch over-invalidating on ordinary
BalancePath/NoncePath writes. That branch since switched to the
precise IncarnationPath signal, so the exclusion is no longer needed
— and it is harmful: the unrecorded probe leaves ValidateVersion
unable to detect account create/destruct conflicts, so parallel
workers commit against stale account state and the parallel-built
block access list diverges from the header.

Surfaced on the EEST bal-devnet (Amsterdam EIP-7928) blocktests:
~607 "block access list mismatch" failures under the parallel
runner; serial unaffected. Restoring the probe recording clears
them. TestDeleteRecreateSlotsAcrossManyBlocks stays 100/100 under
GOMAXPROCS=2 -race — the IncarnationPath check carries the
validator correctness the exclusion was added to protect.
b23da1a changed getStateObject to (a) seed refreshVersionedAccount
with UnknownVersion and (b) drop the synthetic accountRead AddressPath
recording, to quell validator path==AddressPath false positives behind
the SD-recreate flake fix.

Both are now redundant: the merged main carries #21286 (exclude own-tx
writes from the SD revival check), which independently covers the
+1602 / CREATE2-collision flake mode, and the validator already keys
on the precise IncarnationPath signal. They are also harmful — without
the accountRead recording the account is missing from the parallel-
built block access list, so selfdestruct/create-destruct family blocks
fail "block access list mismatch" under the parallel runner (serial
unaffected).

Restoring the original recording clears ~102 EEST bal-devnet
(Amsterdam) failures. TestDeleteRecreateSlotsAcrossManyBlocks stays
200/200 under GOMAXPROCS=2 (100 with -race) — the flake fix is carried
by #21286 and b23da1a's validator/scheduler changes.
@mh0lt

mh0lt commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

Updated — rebased on current main and fixed two real parallel-exec regressions this branch carried.

Re-merge: the branch was 39 commits behind; merged main (picks up #21275 EIP-8037 / v7.2.0 BAL fixtures, #21286 SD-revival fix, etc.).

CI is red was not just stalenessb23da1a90f had two recording changes that broke the parallel-built Block Access List on the Amsterdam (EIP-7928) devnet suite (709 block access list mismatch failures; serial unaffected, main clean). Root-caused by bisection:

  1. versionedio.go excluded AddressPath from the MVReadResultNone probe recording → the validator could no longer detect account create/destruct conflicts → workers committed against stale account state. The exclusion was added to quiet a path==AddressPath over-fire that the validator no longer has (it keys on IncarnationPath now). Fixed in 701b24a789.
  2. getStateObject dropped the accountRead recording → accounts missing from the parallel BAL → selfdestruct/create-destruct family mismatches. Redundant now that merged main carries execution/state: exclude own-tx writes from SD revival check #21286 for the +1602 flake mode. Fixed in 60a14368a2.

Verification (local, parallel runner):

  • devnet blocktests: 82941/82941, 0 failed (was 709) — matches main baseline exactly
  • -race Amsterdam shard: 21368/21368, 0 failed
  • stable blocktests: 69256/69256, 0 failed
  • TestDeleteRecreateSlotsAcrossManyBlocks: 200/200 (incl. 100 -race) — the original flake stays fixed
  • make lint / make erigon integration clean

Also added the SD-staleness validateRead unit tests (0a79dee893) per the Copilot review.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

execution/stagedsync/exec3_status.go:124

  • removeFromList can panic with an out-of-bounds access when v is not present and sort.SearchInts returns len(l): the condition x == -1 || l[x] != v will evaluate l[x] even when x == len(l). sort.SearchInts never returns -1. This should check x == len(l) before indexing (e.g., if x == len(l) || l[x] != v { ... }).
func removeFromList(l []int, v int, expect bool) []int {
	x := sort.SearchInts(l, v)
	if x == -1 || l[x] != v {
		if expect {
			panic(errors.New("should not happen - element expected in list"))
		}

		return l
	}

execution/state/versionedio.go:1053

  • PR description says AddressPath should be excluded from the readStorage==nil MVReadResultNone probe recording, but the current logic records all paths except CodePath (so AddressPath probes are still recorded) and the updated comment asserts they MUST be recorded. Please align the implementation and/or PR description so reviewers/users know whether AddressPath probes are intended to be recorded.
		if readStorage == nil {
			// Record reads so that ValidateVersion can detect when a prior
			// transaction modifies any account property.  Without tracking
			// these reads, validation misses conflicts where a prior tx
			// changes an account's balance/nonce/etc. — causing later txs
			// to execute against stale data, and the parallel-built block
			// access list (AsBlockAccessList) to diverge from the header.
			//
			// AddressPath probes MUST be recorded: getStateObject probes the
			// versionMap for AddressPath before falling back to the
			// stateReader. The validator's path==AddressPath branch cross-
			// checks the precise IncarnationPath signal (account create /
			// destruct), so the recorded probe does not over-invalidate on
			// ordinary BalancePath/NoncePath writes.
			//
			// Do NOT cache CodePath: getStateObject calls versionedRead for
			// CodePath with readStorage=nil to check if a prior tx wrote
			// code (EIP-7702).  Caching defaultV (nil) would poison the
			// ReadSet, causing subsequent getCode calls (which pass a real
			// readStorage callback) to return empty code instead of loading
			// it from the DB — breaking deposit contract execution, etc.
			if !commited && path != CodePath {
				vr.Source = StorageRead
				vr.Val = defaultV
				if s.versionedReads == nil {
					s.versionedReads = ReadSet{}
				}
				s.versionedReads.Set(vr)
			}
			return defaultV, UnknownSource, UnknownVersion, nil

Comment on lines 401 to 415
if source != MapRead {
// When BAL is present, significant writes for BalancePath,
// NoncePath, CodePath and StoragePath are pre-populated in the
// VersionMap before execution. If a read of one of those paths
// was from storage (no VersionMap entry at execution time) but
// the VersionMap now has an entry from a concurrent worker
// flush, the entry is a BAL-filtered no-op write and the read
// value is still correct.
//
// AddressPath and other paths are NOT pre-populated by the BAL,
// so a new VersionMap entry means a real state change from a
// concurrent worker (e.g. account creation) and must trigger
// invalidation.
// VersionMap before execution.
isBALPrePopulatedPath := path == BalancePath || path == NoncePath ||
path == CodePath || path == StoragePath
if !vm.HasBAL || !isBALPrePopulatedPath {
// Value tiebreaker: if the StorageRead value matches the
// versionMap Done value, the read is still valid despite
// the source mismatch. This avoids unnecessary invalidation
// when a prior TX wrote the same value that was in storage.
if readVal != nil && rr.Value() != nil && valuesEqual(path, readVal, rr.Value()) {
// Values match — read is valid
if recursive && readVal == nil {
// Synthetic probe — outer entry's own validation covers it.
} else if readVal != nil && rr.Value() != nil && valuesEqual(path, readVal, rr.Value()) {
// Value tiebreaker: a Done entry now exists where the read
// saw storage, but it holds the same value — read stays valid.
} else {
valid = VersionInvalid
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the recursive cross-validate probes are indeed inert no-ops here. This is currently sound (account-lifecycle detection is carried by the recording-side SelfDestructPath dependency, the explicit ungated IncarnationPath check in the path==AddressPath branch, and the revival-aware SD-staleness check — the SD/Amsterdam-EIP-7928 suites are green). Rather than resurrect the recursive probe in this PR (a naive un-gate over-fires on every in-block-created account), I've filed #21318 to revisit validateReadImpl's cross-validation as part of the parallel-exec validator cleanup.

@mh0lt

mh0lt commented May 20, 2026

Copy link
Copy Markdown
Contributor Author

CI is fully green now (68/68 checks passing) — including the previously-red eest-spec-blocktests-devnet and -race Amsterdam shards. Ready for re-review.

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but please take a look at the latest copilot comment.

@mh0lt mh0lt added this pull request to the merge queue May 20, 2026
Merged via the queue into main with commit 8990fc4 May 20, 2026
81 checks passed
@mh0lt mh0lt deleted the mh/parallel-exec-defer-validator branch May 20, 2026 14:53
yperbasis added a commit that referenced this pull request May 21, 2026
Two conflicts:

1) execution/state/versionmap.go — main split validateRead into a
   thin wrapper plus a new validateReadImpl(..., recursive bool) so
   recursive cross-validates can opt out of the source!=MapRead+nil-
   readVal tiebreaker (PR #21294). The AddressPath cross-check call
   conflicted with moksha's expanded comment (CreateContractPath +
   the deliberate EOA-creation-via-CALL+value gap, post-incarnation
   removal). Resolution: keep moksha's comment + the
   CreateContractPath signal, switch the cross-validate call to the
   new validateReadImpl(..., true) signature.

   Main also added IncarnationPath to the new
   "path != ... else cross-check" guard at line ~422; on moksha
   IncarnationPath does not exist, so the term is dropped.

2) execution/state/state_test.go — main extended SetCode with a
   tracing.CodeChangeReason argument and added a setIncarnation(1)
   call to the test scaffold. moksha already adopted the new SetCode
   signature from main; setIncarnation is gone with the rest of the
   incarnation removal. Resolution: take main's SetCode call, drop
   the setIncarnation line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants