execution/stagedsync: parallel-exec - flush invalid-tx writes as Estimate#21667
Conversation
…mate Apply loop's FlushVersionedWrites was called with `cntInvalid == 0` as the `complete` flag. That counter only tracks *prior* VersionTooEarly txs in the current iteration - it does NOT include the current tx's own VersionInvalid verdict. An invalidated tx's writes were flushed as Done and became visible to downstream OCC readers as committed phantom state. Symptoms observed before this fix: - Intermittent wrong-trie-root failures during from-genesis parallel sync - Gnosis Pattern-2 family gas-mismatches on DEX strategy contracts in the 14.8M-18.5M block range - Captured deterministically at gnosis block 18,483,405: tx[3] inc=0 flagged Invalid, but its slot-write to 0x18b2b767...:0x08 was flushed as Done. tx[16] read the phantom via MapRead, validated against matching readVersion (version-only check passes), committed phantom-derived state. Cascaded gas-mismatch surfaced ~80K blocks later. Fix: gate the `complete` flag on `valid && cntInvalid == 0` via a new applyLoopFlushAsComplete helper. An invalidated tx now flushes as Estimate, which causes downstream reads of those cells to return MVReadResultDependency. The validator treats those as VersionInvalid, forcing the dependent tx to re-execute after the retry settles - restoring OCC's "Done = committed" invariant. Companion clarity changes: - exec3_parallel.go: read txVersion from txResult.Task.Version() (the *taskVersion wrapper that carries the current Incarnation) instead of bare TxTask which always returns Incarnation=0. - txtask.go: drop dead TxTask.Incarnation field. Live source of truth is the txIncarnations counter plus the taskVersion wrapper. Tests: TestApplyLoopFlushAsComplete covers the helper truth-table including the regression guard. TestApplyLoopFlush_InvalidTxWritesAreEstimate drives the production helper through a VersionMap and asserts downstream reads return MVReadResultDependency (not MVReadResultDone). Verification: gnosis oracle CHECK swept 18.46M-18.70M+ end-to-end with zero STATE-ORACLE-DIVERGENCE events. Current from-genesis resync past historical fail blocks 14.84M, 14.85M, 16.42M without recurrence. Needs to be merged before 3.5 is cut.
There was a problem hiding this comment.
Pull request overview
Fixes a correctness issue in parallel execution’s OCC apply/validation loop where writes from transactions that failed validation could be flushed as Done, making them appear committed to downstream readers and causing non-deterministic mis-execution (e.g., wrong trie roots / gas mismatches).
Changes:
- Gate
versionMap.FlushVersionedWrites(..., complete=...)so only validator-approved transactions with no prior invalids in the iteration can flush asDone; otherwise flush asEstimate. - Use the scheduled task wrapper’s
Version()(with current incarnation) when derivingtxVersionduring validation, improving correctness/observability on retries. - Add regression/unit tests covering the flush gating truth table and a VersionMap-level repro asserting downstream reads become
MVReadResultDependency. - Remove the dead
TxTask.Incarnationfield (incarnation is sourced fromtxIncarnations+ thetaskVersionwrapper).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
execution/stagedsync/exec3_parallel.go |
Prevents invalidated tx writes from being flushed as Done; uses result task version to reflect current incarnation. |
execution/stagedsync/exec3_parallel_robustness_test.go |
Adds tests validating the new flush gating behavior and guarding the phantom-write regression. |
execution/exec/txtask.go |
Removes unused TxTask.Incarnation field in favor of the existing incarnation tracking mechanism. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // previously errored despite blocks 1..114 being applied cleanly. | ||
| // | ||
| // applyLoopFlushAsComplete decides the `complete` flag the apply loop passes | ||
| // to versionMap.FlushVersionedWrites. A tx's writes can only be flagged Done | ||
| // (vs Estimate) when: | ||
| // - this tx itself passed validation (`valid`), AND | ||
| // - no prior tx in the same validation iteration failed (`cntInvalid == 0`). | ||
| // | ||
| // The `valid` term is the regression guard for the gnosis-block-18,483,405 | ||
| // phantom-write bug: `cntInvalid` counts *prior* invalid txs in the iteration | ||
| // only, so a current tx with VersionInvalid verdict would otherwise be flushed | ||
| // as Done and become visible to downstream OCC readers as committed state. | ||
| // See TestApplyLoopFlush_InvalidTxWritesAreEstimate. | ||
| func applyLoopFlushAsComplete(valid bool, cntInvalid int) bool { | ||
| return valid && cntInvalid == 0 | ||
| } | ||
|
|
||
| // Pure function — extracted from the apply loop's channel-close branch | ||
| // so the invariant is unit-testable. See TestApplyLoopMissingBlocks. |
Drop the verbose comment block above the FlushVersionedWrites call site - the applyLoopFlushAsComplete helper documents the gating itself. Tighten both function-level comments to focus on the WHY without referencing the gnosis block number / test name (those belong in the commit message). Restore applyLoopMissingBlocks back to its original position above applyLoopFlushAsComplete so its docstring is no longer stranded between two unrelated functions. No behavior change.
|
Copilot's misplaced-docstring observation pointed at a real artifact of the in-flight merge: after Cleanup in
Per project comment rules: no behavior change. |
|
CI update — re-run on Previously-failing shards now PASS on this commit:
That confirms the original The 12 currently-failing shards in the latest run are all GitHub releases 504s during fixture download ( Local hive verification ( Net read: the OCC fix is not introducing wrong-trie-root failures. The intermittent wrong-trie-root class is being investigated as a follow-up. |
|
Regression confirmed — blocking ship. Reproduced locally with a tight retry loop on the specific failing CI test (`tests/prague/eip7685_general_purpose_el_requests/test_multi_type_requests.py::test_valid_multi_type_requests[fork_Prague-blockchain_test-withdrawal_from_eoa+withdrawal_from_contract+deposit_from_contract]`):
So the OCC fix is introducing an intermittent wrong-trie-root for this test under parallel exec. Root cause (in progress)Added a per-tx-flush dump probe in `calcState.FlushToUpdates`. Captured `good-flush.tsv` and `bad-flush.tsv` from successive runs — they differ on exactly one address: Address `0x2adc25665018aa1fe0e6bc666dac8fc2697ff9ba` = the block coinbase:
The bad run incorrectly EIP-161-empty-deletes the coinbase. Reading `calcState.ApplyWrites`: when a worker's inc=0 emits `SelfDestructPath=true` for X (EIP-161 emptiness check against a stale balance read), `calcState` marks `X.Deleted=true`. The apply-loop later invalidates the tx; under this PR's Estimate flag, downstream txs re-execute correctly — but the re-exec's inc=1 writeset doesn't touch X (the coinbase is touched via a different tx), so `calcState` keeps the stale `Deleted=true`. Last-write-wins doesn't recover. Pre-fix, the invalid tx's writes were flushed as Done in versionMap, downstream txs read the phantom and committed phantom-derived state — also wrong but consistent with the calc accumulator's stale view. So the trie root happened to match. Fix surfaceThe commit pipeline (`commitResults` channel → `calcState.ApplyWrites`) needs to honor OCC validity. Three options:
I'll explore option 2 first as it preserves the existing channel topology. |
…imate read calcFees uses versionedStateReader to fetch coinbase pre-state for the EIP-161 empty-removal check. versionedStateReader only honours Done floors and treats MVReadResultDependency (Estimate-flagged prior writes) as "not found", so coinbaseAcc becomes nil when a prior tx has an in-flight write to any coinbase account-state path. For zero-tip system txs that flipped coinbaseEmptyPre to true, calcFees emitted a bogus SelfDestructPath=true into the validated write set, which then propagated into the commit channel and the commitment trie -> wrong trie root. Pre-21591 this was masked: invalid-tx writes were flushed as Done, so the vsReader saw the phantom balance and coinbaseEmptyPre stayed false. With the OCC flush fix in 221007e, invalid-tx writes are correctly flushed as Estimate, and this latent calcFees defect surfaced as an intermittent EEST regression on test_valid_multi_type_requests[fork_Prague-blockchain_test-withdrawal_from_eoa+withdrawal_from_contract+deposit_from_contract] (24% fail rate on this branch vs 0% on main). Fix: when coinbaseEmptyPre would otherwise hold, probe the versionMap directly for an Estimate floor on Address/Balance/Nonce/CodeHash paths. If any exists, the read is indeterminate and the SD-emit path is suppressed; the validator will invalidate the current tx on the prior write's recorded version mismatch and re-execute against the Done value. Local repro of the failing fixture: 50/50 pass after fix (was 12/50).
…t Done versionedUpdate, used by versionedStateReader for finalize / system-tx reconstruction, only returned a versionMap value when the floor cell was Done, falling back to the pre-block DB value (recording no dependency) on an Estimate (MVReadResultDependency) floor. Since 221007e flushes invalidated txs' writes as Estimate, the finalize path resurrects stale pre-block state for any in-flight prior write, committing a wrong write set and producing nondeterministic wrong state roots in EL-requests blocks under parallel execution. Consume Estimate floors too: they hold the same latest in-block write a Done floor does, and finalize reads are re-validated (the tx re-executes) if the writer's value later changes. The worker read-path, which aborts on an Estimate read, is unchanged, so the phantom-write fix is preserved.
…D on Estimate read" This reverts commit d7f7a40. That commit worked around a single instance of the finalize reader's Done-only defect — the coinbase EIP-161 empty-removal check in calcFees — by probing the versionMap for an Estimate floor and suppressing the SD-emit. The previous commit fixes the defect at its source in versionedUpdate, so vsReader now returns the coinbase's in-flight value and coinbaseEmptyPre is computed from the real balance rather than a spurious nil. The targeted probe is therefore redundant.
…t Done (#21678) ## Summary Root-cause fix for the nondeterministic wrong-state-root regression on this branch, generalizing the Estimate-read handling that `d7f7a40c5d` patched for one case. `versionedUpdate` (the `versionedStateReader` used for finalize / system-tx reconstruction) only consumed a versionMap floor flagged `Done`, treating an `Estimate` (`MVReadResultDependency`) floor as "not found" and falling back to the pre-block DB value. Since `221007e6a1` flushes invalidated txs' writes as `Estimate`, the finalize path resurrects stale pre-block state whenever a prior in-flight write exists — committing a wrong write set, producing **nondeterministic wrong state roots** in EL-requests blocks under parallel execution. `d7f7a40c5d` fixed one manifestation (the coinbase EIP-161 SD check in `calcFees`) with a targeted probe; that commit's own message names the general defect ("versionedStateReader only honours Done floors and treats MVReadResultDependency … as not found"). This PR fixes it at the source, so every finalize-path read (storage, all account fields, code) is covered. ## Commits 1. **finalize reader must consume Estimate cells** — `versionedUpdate` now consumes a floor that is `Done` *or* `Estimate`. An Estimate floor holds the same latest in-block write a Done floor does; finalize reads are re-validated (the tx re-executes) if the writer's value later changes — the same OCC safety net the rest of the apply loop relies on. The worker read-path, which aborts on an Estimate read, is unchanged, so the phantom-write fix this branch targets is preserved. 2. **Revert the `calcFees` coinbase probe** (`d7f7a40c5d`) — now redundant: `vsReader` returns the coinbase's in-flight value, so `coinbaseEmptyPre` is computed from the real balance rather than a spurious nil. ## Verification The net change is `versionedUpdate` only (the revert cancels `d7f7a40c5d`), so the branch is code-identical to "this branch before the coinbase probe, plus the versionedUpdate fix" — the configuration that was load-tested: - **eip7685 EL-requests**, 16-worker load, interleaved on one machine: unfixed branch ~4/30 (blocktest) and 9/25 (engine-x) → **0** with this fix. - **Full prague** (2,490 tests) × 4 passes: **0** failures, **0** regressions vs the serial baseline. - **Full-prague sensitivity** (5 passes): unfixed 1/5 → **0/5**. - `execution/stagedsync` + `execution/state` unit tests green (incl. this branch's phantom-write guard); `go vet` + `make lint` clean. Single-test runs pass; the bug only surfaces under load (CPU contention perturbs exec3 scheduling), which is why CI caught it but a one-off run does not. Targets `mh/parallel-exec-flush-invalid-as-estimate` so it lands before #21667 merges.
| invalidTxWrites := state.VersionedWrites{ | ||
| { | ||
| Address: addr, | ||
| Path: state.StoragePath, | ||
| Key: slot, | ||
| Version: state.Version{TxIndex: invalidTxIdx, Incarnation: invalidTxInc}, | ||
| Val: phantomVal, | ||
| }, | ||
| } |
| // versionMap.FlushVersionedWrites. cntInvalid counts only prior | ||
| // VersionTooEarly txs in the current iteration — not the current tx's own | ||
| // VersionInvalid verdict — so the `valid` term is required to prevent an | ||
| // invalidated tx's writes being flushed as Done and read as committed by | ||
| // downstream OCC consumers. |
…omment Per Copilot review on #21667: cntInvalid is incremented on both VersionTooEarly (apply loop line 2444) and VersionInvalid (line 2602) within the same iteration, not only on VersionTooEarly. The load-bearing property is that the current tx's own verdict is not included when applyLoopFlushAsComplete is called. No behavior change.
Brings origin/main (through the latest tip) into the typed-vio branch (#21536). Resolution: - committer.go: dropped the BAL-ahead-fold (foldedAhead/maybeFoldAhead/ foldBlockFromBAL/shadowCrossCheck) — it was introduced inside the typed-vio commit and is not on main. Took main's committer.go (the #21659 changeset-window: perBlockFrom/computeTransition) and removed the matching wiring from exec3.go / exec3_parallel.go (blockRequest channel + type, the unused calcMode enum). - execution/state: kept the typed per-path versioned reads/writes; applied main's #21667 (accept Dependency/Estimate cells), #21590 (SD-revival), #21659 (per-batch changesets), #21654 (phantom accesses). The storage net-zero read-value filter is retained in updateWrite, with new guard tests. Verified: build, make lint, execution/{state,exec,stagedsync,commitment} unit tests, and eest-spec parallel blocktests (devnet 82941/0, stable 69256/0).
Summary
Fixes a parallel-exec OCC correctness bug: the apply loop was flushing
writes from txs that failed validation (
VersionInvalid) with thecomplete=Doneflag, leaking phantom committed state into the versionMapwhere downstream readers picked it up.
Symptoms before this fix
sync (PR common/dbg: default EXEC3_PARALLEL=true #21591 made
EXEC3_PARALLEL=truethe default).especially in the gnosis 14.8M-18.5M block range.
tx[N+k] via
MapRead. Version-only validation passes becausereadVersion == writtenVersion. The downstream tx commits state derivedfrom the phantom, and the divergence surfaces as a gas-mismatch tens of
thousands of blocks later.
Captured deterministically at gnosis block 18,483,405: tx[3] inc=0
was flagged Invalid by
ValidateVersionBlock, but its 28 writes wereflushed with
complete=truebecausecntInvalid == 0(the counter onlytracks prior
VersionTooEarlytxs). Slot 0x08 on contract0x18b2b7673c6d661923e9460d592699617828b293was stored with valueaabS...5981. tx[16] subsequently read that phantom viaMapReadandcommitted phantom-derived state. The cascade surfaced as a gas-mismatch
roughly 80K blocks downstream.
Fix
New helper
applyLoopFlushAsComplete(valid, cntInvalid) = valid && cntInvalid == 0gates the
completeflag. An invalidated tx now flushes as Estimate,which causes downstream reads of those cells to return
MVReadResultDependency. The validator treats that asVersionInvalid,forcing the dependent tx to re-execute after the retry settles - restoring
OCC's `Done = committed` invariant.
Companion clarity changes in the same files:
(the `*taskVersion` wrapper that carries the current `Incarnation`)
instead of bare `TxTask` which always returns `Incarnation=0`. Logs
and traces now show the correct incarnation on retries.
truth is the `txIncarnations` counter plus the `taskVersion` wrapper.
Verification
Unit tests (`execution/stagedsync/exec3_parallel_robustness_test.go`):
including the regression guard case "INVALID current tx -> must NOT be Done".
production helper through a real `VersionMap` using the gnosis
18,483,405 contract+slot, asserts downstream read returns
`MVReadResultDependency` (not `MVReadResultDone`).
`cntInvalid == 0`, PASS with the fix.
Gnosis oracle CHECK (live): swept blocks 18.46M-18.70M+ end-to-end
with zero `STATE-ORACLE-DIVERGENCE` events, zero gas-mismatch, zero
wrong-trie-root.
From-genesis resync (live): currently past historical fail blocks
14,837,280 / 14,845,967 / 16,420,113 without recurrence; still in
progress through the rest of the 14.8M-18.5M range.
Milestone
Needs to be merged before 3.5 is cut. With PR #21591 making
`EXEC3_PARALLEL=true` the default on main, this bug class affects all
3.5+ users running parallel execution. Marked for the 3.5.0 milestone.
Test plan
`(valid, cntInvalid)` cases
-> re-exec path closes the OCC invariant