Skip to content

execution/stagedsync: parallel-exec - flush invalid-tx writes as Estimate#21667

Merged
yperbasis merged 8 commits into
mainfrom
mh/parallel-exec-flush-invalid-as-estimate
Jun 8, 2026
Merged

execution/stagedsync: parallel-exec - flush invalid-tx writes as Estimate#21667
yperbasis merged 8 commits into
mainfrom
mh/parallel-exec-flush-invalid-as-estimate

Conversation

@mh0lt

@mh0lt mh0lt commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a parallel-exec OCC correctness bug: the apply loop was flushing
writes from txs that failed validation (VersionInvalid) with the
complete=Done flag, leaking phantom committed state into the versionMap
where downstream readers picked it up.

Symptoms before this fix

  • Intermittent wrong-trie-root failures during from-genesis parallel
    sync (PR common/dbg: default EXEC3_PARALLEL=true #21591 made EXEC3_PARALLEL=true the default).
  • Pattern-2 family: gas-mismatch failures on DEX strategy contracts,
    especially in the gnosis 14.8M-18.5M block range.
  • Cascading mis-execution: a phantom write at tx[N] is picked up by
    tx[N+k] via MapRead. Version-only validation passes because
    readVersion == writtenVersion. The downstream tx commits state derived
    from the phantom, and the divergence surfaces as a gas-mismatch tens of
    thousands of blocks later.

Captured deterministically at gnosis block 18,483,405: tx[3] inc=0
was flagged Invalid by ValidateVersionBlock, but its 28 writes were
flushed with complete=true because cntInvalid == 0 (the counter only
tracks prior VersionTooEarly txs). Slot 0x08 on contract
0x18b2b7673c6d661923e9460d592699617828b293 was stored with value
aabS...5981. tx[16] subsequently read that phantom via MapRead and
committed phantom-derived state. The cascade surfaced as a gas-mismatch
roughly 80K blocks downstream.

Fix

New helper applyLoopFlushAsComplete(valid, cntInvalid) = valid && cntInvalid == 0
gates the complete flag. An invalidated tx now flushes as Estimate,
which causes downstream reads of those cells to return
MVReadResultDependency. The validator treats that as VersionInvalid,
forcing the dependent tx to re-execute after the retry settles - restoring
OCC's `Done = committed` invariant.

Companion clarity changes in the same files:

  • `exec3_parallel.go`: read `txVersion` from `txResult.Task.Version()`
    (the `*taskVersion` wrapper that carries the current `Incarnation`)
    instead of bare `TxTask` which always returns `Incarnation=0`. Logs
    and traces now show the correct incarnation on retries.
  • `txtask.go`: drop the dead `TxTask.Incarnation` field. Live source of
    truth is the `txIncarnations` counter plus the `taskVersion` wrapper.

Verification

  • Unit tests (`execution/stagedsync/exec3_parallel_robustness_test.go`):

    • `TestApplyLoopFlushAsComplete` - 4-case truth-table on the helper,
      including the regression guard case "INVALID current tx -> must NOT be Done".
    • `TestApplyLoopFlush_InvalidTxWritesAreEstimate` - drives the
      production helper through a real `VersionMap` using the gnosis
      18,483,405 contract+slot, asserts downstream read returns
      `MVReadResultDependency` (not `MVReadResultDone`).
    • Both confirmed to FAIL with the helper temporarily reverted to
      `cntInvalid == 0`, PASS with the fix.
  • Gnosis oracle CHECK (live): swept blocks 18.46M-18.70M+ end-to-end
    with zero `STATE-ORACLE-DIVERGENCE` events, zero gas-mismatch, zero
    wrong-trie-root.

  • From-genesis resync (live): currently past historical fail blocks
    14,837,280 / 14,845,967 / 16,420,113 without recurrence; still in
    progress through the rest of the 14.8M-18.5M range.

Milestone

Needs to be merged before 3.5 is cut. With PR #21591 making
`EXEC3_PARALLEL=true` the default on main, this bug class affects all
3.5+ users running parallel execution. Marked for the 3.5.0 milestone.

Test plan

  • CI green
  • Reviewer checks the helper's truth-table covers all four
    `(valid, cntInvalid)` cases
  • Reviewer confirms the `MVReadResultDependency` -> validator
    -> re-exec path closes the OCC invariant

…mate

Apply loop's FlushVersionedWrites was called with `cntInvalid == 0` as the
`complete` flag. That counter only tracks *prior* VersionTooEarly txs in the
current iteration - it does NOT include the current tx's own VersionInvalid
verdict. An invalidated tx's writes were flushed as Done and became visible
to downstream OCC readers as committed phantom state.

Symptoms observed before this fix:
- Intermittent wrong-trie-root failures during from-genesis parallel sync
- Gnosis Pattern-2 family gas-mismatches on DEX strategy contracts in the
  14.8M-18.5M block range
- Captured deterministically at gnosis block 18,483,405: tx[3] inc=0
  flagged Invalid, but its slot-write to 0x18b2b767...:0x08 was flushed as
  Done. tx[16] read the phantom via MapRead, validated against matching
  readVersion (version-only check passes), committed phantom-derived state.
  Cascaded gas-mismatch surfaced ~80K blocks later.

Fix: gate the `complete` flag on `valid && cntInvalid == 0` via a new
applyLoopFlushAsComplete helper. An invalidated tx now flushes as Estimate,
which causes downstream reads of those cells to return MVReadResultDependency.
The validator treats those as VersionInvalid, forcing the dependent tx to
re-execute after the retry settles - restoring OCC's "Done = committed"
invariant.

Companion clarity changes:
- exec3_parallel.go: read txVersion from txResult.Task.Version() (the
  *taskVersion wrapper that carries the current Incarnation) instead of
  bare TxTask which always returns Incarnation=0.
- txtask.go: drop dead TxTask.Incarnation field. Live source of truth is
  the txIncarnations counter plus the taskVersion wrapper.

Tests: TestApplyLoopFlushAsComplete covers the helper truth-table including
the regression guard. TestApplyLoopFlush_InvalidTxWritesAreEstimate drives
the production helper through a VersionMap and asserts downstream reads
return MVReadResultDependency (not MVReadResultDone).

Verification: gnosis oracle CHECK swept 18.46M-18.70M+ end-to-end with zero
STATE-ORACLE-DIVERGENCE events. Current from-genesis resync past historical
fail blocks 14.84M, 14.85M, 16.42M without recurrence.

Needs to be merged before 3.5 is cut.
@mh0lt mh0lt requested a review from yperbasis as a code owner June 7, 2026 21:27
@mh0lt mh0lt added this to the 3.5.0 milestone Jun 7, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a correctness issue in parallel execution’s OCC apply/validation loop where writes from transactions that failed validation could be flushed as Done, making them appear committed to downstream readers and causing non-deterministic mis-execution (e.g., wrong trie roots / gas mismatches).

Changes:

  • Gate versionMap.FlushVersionedWrites(..., complete=...) so only validator-approved transactions with no prior invalids in the iteration can flush as Done; otherwise flush as Estimate.
  • Use the scheduled task wrapper’s Version() (with current incarnation) when deriving txVersion during validation, improving correctness/observability on retries.
  • Add regression/unit tests covering the flush gating truth table and a VersionMap-level repro asserting downstream reads become MVReadResultDependency.
  • Remove the dead TxTask.Incarnation field (incarnation is sourced from txIncarnations + the taskVersion wrapper).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
execution/stagedsync/exec3_parallel.go Prevents invalidated tx writes from being flushed as Done; uses result task version to reflect current incarnation.
execution/stagedsync/exec3_parallel_robustness_test.go Adds tests validating the new flush gating behavior and guarding the phantom-write regression.
execution/exec/txtask.go Removes unused TxTask.Incarnation field in favor of the existing incarnation tracking mechanism.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/stagedsync/exec3_parallel.go Outdated
Comment on lines 1187 to 1205
// previously errored despite blocks 1..114 being applied cleanly.
//
// applyLoopFlushAsComplete decides the `complete` flag the apply loop passes
// to versionMap.FlushVersionedWrites. A tx's writes can only be flagged Done
// (vs Estimate) when:
// - this tx itself passed validation (`valid`), AND
// - no prior tx in the same validation iteration failed (`cntInvalid == 0`).
//
// The `valid` term is the regression guard for the gnosis-block-18,483,405
// phantom-write bug: `cntInvalid` counts *prior* invalid txs in the iteration
// only, so a current tx with VersionInvalid verdict would otherwise be flushed
// as Done and become visible to downstream OCC readers as committed state.
// See TestApplyLoopFlush_InvalidTxWritesAreEstimate.
func applyLoopFlushAsComplete(valid bool, cntInvalid int) bool {
return valid && cntInvalid == 0
}

// Pure function — extracted from the apply loop's channel-close branch
// so the invariant is unit-testable. See TestApplyLoopMissingBlocks.
Drop the verbose comment block above the FlushVersionedWrites call site -
the applyLoopFlushAsComplete helper documents the gating itself. Tighten
both function-level comments to focus on the WHY without referencing the
gnosis block number / test name (those belong in the commit message).
Restore applyLoopMissingBlocks back to its original position above
applyLoopFlushAsComplete so its docstring is no longer stranded between
two unrelated functions.

No behavior change.
@mh0lt

mh0lt commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Copilot's misplaced-docstring observation pointed at a real artifact of the in-flight merge: after Merge branch 'main' into mh/parallel-exec-flush-invalid-as-estimate, main's richer applyLoopMissingBlocks docstring landed above applyLoopFlushAsComplete, and an older one-liner ended up stranded directly above applyLoopMissingBlocks — so the file had two comments stacked weirdly.

Cleanup in 7d887cc702:

  • Moved applyLoopMissingBlocks (and its main-side docstring) back into a single block above its own function.
  • Put applyLoopFlushAsComplete immediately after it, with a tightened comment.
  • Dropped the verbose comment above the FlushVersionedWrites call site (the helper documents itself) and tightened the txVersion source-explainer.

Per project comment rules: no behavior change.

@mh0lt

mh0lt commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

CI update — re-run on 7d887cc702 (comment cleanup only, no behavior change):

Previously-failing shards now PASS on this commit:

  • eest-spec-blocktests-devnet
  • eest-spec-enginextests-stable-parallel

That confirms the original Wrong trie root of block 1 failures on 221007e6a1 were intermittent, the same class as the parallel-exec wrong-trie-root we're separately investigating on mainnet from-genesis — not a deterministic regression from this fix.

The 12 currently-failing shards in the latest run are all GitHub releases 504s during fixture download (curl: (22) The requested URL returned error: 504 while fetching fixtures_bal.tar.gz). Retriable infrastructure issue, unrelated to the diff.

Local hive verification (./hive --sim ethereum/eels/consume-engine --sim.limit ".*eip7685_general_purpose_el_requests.*" --sim.buildarg fixtures=…bal@v7.1.1) hit a different failure category (invalid requests root hash, depositContractAddress is 0x0) which is a hive-config / chain-config issue, not OCC-related — test_valid_* cases fail because Erigon computes empty deposit requests when the deposit-contract address is unset.

Net read: the OCC fix is not introducing wrong-trie-root failures. The intermittent wrong-trie-root class is being investigated as a follow-up.

@mh0lt

mh0lt commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

Regression confirmed — blocking ship.

Reproduced locally with a tight retry loop on the specific failing CI test (`tests/prague/eip7685_general_purpose_el_requests/test_multi_type_requests.py::test_valid_multi_type_requests[fork_Prague-blockchain_test-withdrawal_from_eoa+withdrawal_from_contract+deposit_from_contract]`):

Binary Failure rate
origin/main baseline (no OCC fix) 0% (0/30)
this PR 24% (12/50)
this PR, EXEC3_PARALLEL=false 0% (0/20)

So the OCC fix is introducing an intermittent wrong-trie-root for this test under parallel exec.

Root cause (in progress)

Added a per-tx-flush dump probe in `calcState.FlushToUpdates`. Captured `good-flush.tsv` and `bad-flush.tsv` from successive runs — they differ on exactly one address:

Address `0x2adc25665018aa1fe0e6bc666dac8fc2697ff9ba` = the block coinbase:

  • GOOD: `flags=13 (Code|Balance|Nonce), bal=205708998560037, deleted=false`
  • BAD: `flags=2 (Delete), bal=0, deleted=true`

The bad run incorrectly EIP-161-empty-deletes the coinbase. Reading `calcState.ApplyWrites`: when a worker's inc=0 emits `SelfDestructPath=true` for X (EIP-161 emptiness check against a stale balance read), `calcState` marks `X.Deleted=true`. The apply-loop later invalidates the tx; under this PR's Estimate flag, downstream txs re-execute correctly — but the re-exec's inc=1 writeset doesn't touch X (the coinbase is touched via a different tx), so `calcState` keeps the stale `Deleted=true`. Last-write-wins doesn't recover.

Pre-fix, the invalid tx's writes were flushed as Done in versionMap, downstream txs read the phantom and committed phantom-derived state — also wrong but consistent with the calc accumulator's stale view. So the trie root happened to match.

Fix surface

The commit pipeline (`commitResults` channel → `calcState.ApplyWrites`) needs to honor OCC validity. Three options:

  1. Don't send to `commitResults` until apply-loop has validated the tx — adds latency to the calc pipeline.
  2. Have calc consume from `blockIO` (the validated set) rather than raw worker output.
  3. Track per-tx writes in `calcState` and undo when validator marks invalid.

I'll explore option 2 first as it preserves the existing channel topology.

mh0lt and others added 3 commits June 8, 2026 11:02
…imate read

calcFees uses versionedStateReader to fetch coinbase pre-state for the
EIP-161 empty-removal check. versionedStateReader only honours Done
floors and treats MVReadResultDependency (Estimate-flagged prior writes)
as "not found", so coinbaseAcc becomes nil when a prior tx has an
in-flight write to any coinbase account-state path. For zero-tip system
txs that flipped coinbaseEmptyPre to true, calcFees emitted a bogus
SelfDestructPath=true into the validated write set, which then
propagated into the commit channel and the commitment trie -> wrong
trie root.

Pre-21591 this was masked: invalid-tx writes were flushed as Done, so
the vsReader saw the phantom balance and coinbaseEmptyPre stayed false.
With the OCC flush fix in 221007e, invalid-tx writes are correctly
flushed as Estimate, and this latent calcFees defect surfaced as an
intermittent EEST regression on
test_valid_multi_type_requests[fork_Prague-blockchain_test-withdrawal_from_eoa+withdrawal_from_contract+deposit_from_contract]
(24% fail rate on this branch vs 0% on main).

Fix: when coinbaseEmptyPre would otherwise hold, probe the versionMap
directly for an Estimate floor on Address/Balance/Nonce/CodeHash paths.
If any exists, the read is indeterminate and the SD-emit path is
suppressed; the validator will invalidate the current tx on the prior
write's recorded version mismatch and re-execute against the Done value.

Local repro of the failing fixture: 50/50 pass after fix (was 12/50).
…t Done

versionedUpdate, used by versionedStateReader for finalize / system-tx
reconstruction, only returned a versionMap value when the floor cell was
Done, falling back to the pre-block DB value (recording no dependency) on an
Estimate (MVReadResultDependency) floor. Since 221007e flushes
invalidated txs' writes as Estimate, the finalize path resurrects stale
pre-block state for any in-flight prior write, committing a wrong write set
and producing nondeterministic wrong state roots in EL-requests blocks under
parallel execution.

Consume Estimate floors too: they hold the same latest in-block write a Done
floor does, and finalize reads are re-validated (the tx re-executes) if the
writer's value later changes. The worker read-path, which aborts on an
Estimate read, is unchanged, so the phantom-write fix is preserved.
…D on Estimate read"

This reverts commit d7f7a40.

That commit worked around a single instance of the finalize reader's
Done-only defect — the coinbase EIP-161 empty-removal check in calcFees —
by probing the versionMap for an Estimate floor and suppressing the SD-emit.
The previous commit fixes the defect at its source in versionedUpdate, so
vsReader now returns the coinbase's in-flight value and coinbaseEmptyPre is
computed from the real balance rather than a spurious nil. The targeted
probe is therefore redundant.
…t Done (#21678)

## Summary

Root-cause fix for the nondeterministic wrong-state-root regression on
this branch, generalizing the Estimate-read handling that `d7f7a40c5d`
patched for one case.

`versionedUpdate` (the `versionedStateReader` used for finalize /
system-tx reconstruction) only consumed a versionMap floor flagged
`Done`, treating an `Estimate` (`MVReadResultDependency`) floor as "not
found" and falling back to the pre-block DB value. Since `221007e6a1`
flushes invalidated txs' writes as `Estimate`, the finalize path
resurrects stale pre-block state whenever a prior in-flight write exists
— committing a wrong write set, producing **nondeterministic wrong state
roots** in EL-requests blocks under parallel execution.

`d7f7a40c5d` fixed one manifestation (the coinbase EIP-161 SD check in
`calcFees`) with a targeted probe; that commit's own message names the
general defect ("versionedStateReader only honours Done floors and
treats MVReadResultDependency … as not found"). This PR fixes it at the
source, so every finalize-path read (storage, all account fields, code)
is covered.

## Commits

1. **finalize reader must consume Estimate cells** — `versionedUpdate`
now consumes a floor that is `Done` *or* `Estimate`. An Estimate floor
holds the same latest in-block write a Done floor does; finalize reads
are re-validated (the tx re-executes) if the writer's value later
changes — the same OCC safety net the rest of the apply loop relies on.
The worker read-path, which aborts on an Estimate read, is unchanged, so
the phantom-write fix this branch targets is preserved.
2. **Revert the `calcFees` coinbase probe** (`d7f7a40c5d`) — now
redundant: `vsReader` returns the coinbase's in-flight value, so
`coinbaseEmptyPre` is computed from the real balance rather than a
spurious nil.

## Verification

The net change is `versionedUpdate` only (the revert cancels
`d7f7a40c5d`), so the branch is code-identical to "this branch before
the coinbase probe, plus the versionedUpdate fix" — the configuration
that was load-tested:

- **eip7685 EL-requests**, 16-worker load, interleaved on one machine:
unfixed branch ~4/30 (blocktest) and 9/25 (engine-x) → **0** with this
fix.
- **Full prague** (2,490 tests) × 4 passes: **0** failures, **0**
regressions vs the serial baseline.
- **Full-prague sensitivity** (5 passes): unfixed 1/5 → **0/5**.
- `execution/stagedsync` + `execution/state` unit tests green (incl.
this branch's phantom-write guard); `go vet` + `make lint` clean.

Single-test runs pass; the bug only surfaces under load (CPU contention
perturbs exec3 scheduling), which is why CI caught it but a one-off run
does not.

Targets `mh/parallel-exec-flush-invalid-as-estimate` so it lands before
#21667 merges.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment on lines +959 to +967
invalidTxWrites := state.VersionedWrites{
{
Address: addr,
Path: state.StoragePath,
Key: slot,
Version: state.Version{TxIndex: invalidTxIdx, Incarnation: invalidTxInc},
Val: phantomVal,
},
}
Comment thread execution/stagedsync/exec3_parallel.go Outdated
Comment on lines +1199 to +1203
// versionMap.FlushVersionedWrites. cntInvalid counts only prior
// VersionTooEarly txs in the current iteration — not the current tx's own
// VersionInvalid verdict — so the `valid` term is required to prevent an
// invalidated tx's writes being flushed as Done and read as committed by
// downstream OCC consumers.
…omment

Per Copilot review on #21667: cntInvalid is incremented on both
VersionTooEarly (apply loop line 2444) and VersionInvalid (line 2602)
within the same iteration, not only on VersionTooEarly. The
load-bearing property is that the current tx's own verdict is not
included when applyLoopFlushAsComplete is called.

No behavior change.
@yperbasis yperbasis added this pull request to the merge queue Jun 8, 2026
Merged via the queue into main with commit 17a45ce Jun 8, 2026
90 checks passed
@yperbasis yperbasis deleted the mh/parallel-exec-flush-invalid-as-estimate branch June 8, 2026 13:27
mh0lt added a commit that referenced this pull request Jun 12, 2026
Brings origin/main (through the latest tip) into the typed-vio branch (#21536).

Resolution:
- committer.go: dropped the BAL-ahead-fold (foldedAhead/maybeFoldAhead/
  foldBlockFromBAL/shadowCrossCheck) — it was introduced inside the typed-vio
  commit and is not on main. Took main's committer.go (the #21659
  changeset-window: perBlockFrom/computeTransition) and removed the matching
  wiring from exec3.go / exec3_parallel.go (blockRequest channel + type, the
  unused calcMode enum).
- execution/state: kept the typed per-path versioned reads/writes; applied
  main's #21667 (accept Dependency/Estimate cells), #21590 (SD-revival),
  #21659 (per-batch changesets), #21654 (phantom accesses). The storage
  net-zero read-value filter is retained in updateWrite, with new guard tests.

Verified: build, make lint, execution/{state,exec,stagedsync,commitment} unit
tests, and eest-spec parallel blocktests (devnet 82941/0, stable 69256/0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants