Skip to content

db/state: preserve per-step unwind entries so Flush reverts every orphan#20710

Merged
yperbasis merged 5 commits into
mainfrom
execution/fix-commitment-unwind-divergence
Apr 21, 2026
Merged

db/state: preserve per-step unwind entries so Flush reverts every orphan#20710
yperbasis merged 5 commits into
mainfrom
execution/fix-commitment-unwind-divergence

Conversation

@yperbasis

@yperbasis yperbasis commented Apr 21, 2026

Copy link
Copy Markdown
Member

Summary

TemporalMemBatch.Unwind stored the incoming changeset in sd.unwindChangeset, a map keyed by change.Key[:len(change.Key)-8] — the real key with the 8-byte step suffix stripped. That shape is what the getLatest fallback needs (one pre-unwind value per real key), but Flush was re-exporting the same collapsed map to tx.UnwindAggregatorRoTx.UnwindDomainRoTx.unwind. Because only one entry per real key survived the collapse, per-step DeleteCurrent calls fired for only one step per key and orphan domain-values entries were left behind at every other step.

On mainnet this manifested as the commitment domain drifting one step ahead of every other domain after a sequence of forkchoice-driven unwinds. SeekCommitment then returned a stale blockNum above TxNums.Last(), execution started past the truly-executed tip, and the next block was rejected with "nonce too high".

Fix: keep a separate unwindChangesetRaw (step-preserving) alongside the collapsed map. Flush replays the raw list against tx.Unwind; getLatest continues to consult the collapsed map. Merge carries (and ClearRam clears) both copies.

Why

On a mainnet node, integration print_stages showed:

accounts    progress(txnum) 3460929701   step 8859
storage     progress(txnum) 3460929699   step 8859
code        progress(txnum) 3460929625   step 8859
commitment  progress(txnum) 3460937500   step 8860   <-- ahead by one full step
receipt     progress(txnum) 3460929700   step 8859

A direct read of the commitment values table turned up 22,038 orphan step-8860 entries plus the "state" key pointing at blockNum=24923992 after execution had been unwound to 24923939. The observable symptom was an endless invalid block, txnIdx=2, nonce too high: address 0x13c12Fe9…, tx: 14 state: 13 retry loop every ~12s.

Repro

New test TestSharedDomain_RepeatedUnwindAcrossStepBoundary writes accounts+commitment across three steps, runs three (execute, flush, unwind-to-earlier-step, flush) cycles and asserts that

  • the commitment "state" key's decoded blockNum ≤ unwindTarget, and
  • the commitment values table has no entries with step > unwindTarget/stepSize.

On main this fails with step=2 bn=29 / step=1 bn=19 dups still present after every unwind; with the fix the only remaining dups are at step 0 (the unwind-target step) and the latest points at the restored block.

TestSharedDomain_UnwindAcrossStepBoundary is a simpler single-unwind regression guard for the same invariant.

Notes

  • The Merge path now merges unwindChangesetRaw with the same newer/older directionality as unwindChangeset. It uses changeset.MergeDiffSets, which already dedupes by full (key, step) — so this doesn't reintroduce the collapse bug.

TemporalMemBatch.Unwind stored the incoming changeset in sd.unwindChangeset,
a map keyed by change.Key[:len(change.Key)-8] — i.e. the real key with the
step suffix stripped. That's the shape getLatest needs for its fallback
lookup (one pre-unwind value per real key), but Flush was re-exporting the
same collapsed map to tx.Unwind and into DomainRoTx.unwind — so only one
entry per real key survived, no matter how many (key, step) pairs the
original diff carried.

Each entry drives a DupSort DeleteCurrent at a specific step. When a key
was touched inside more than one step of the unwound range (e.g. a
commitment branch written to step N and N+1 during forward execution),
only one step's dup got deleted and the rest remained as orphans. For the
commitment domain this left the "state" key pointing at a blockNum beyond
the unwind target, and on the next open SeekCommitment returned that
stale block — so execution started past TxNums.Last() and rejected the
next block with "nonce too high" (observed on mainnet after three
forkchoice-driven ~54-block unwinds: 22k orphan step-8860 entries plus a
state key frozen at blockNum=24923992).

Fix: keep a separate unwindChangesetRaw alongside the collapsed map.
Flush replays the raw (step-preserving) list against tx.Unwind;
getLatest still consults the collapsed map. Merge and ClearRam carry and
clear the raw copy too.

New tests cover single-unwind across a step boundary and three
successive unwind cycles; both now verify the commitment "state" blockNum
and the values-table step tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an unwind/flush corruption case where per-step domain-value entries were being collapsed to a single entry per “real key”, causing Flush() to miss deleting orphan entries at other steps (observed as the commitment domain drifting ahead of other domains after repeated FCU-driven unwinds).

Changes:

  • Preserve step-specific unwind diffs by storing a step-preserving unwindChangesetRaw alongside the collapsed unwindChangeset.
  • Update Flush() to replay unwinds using the raw (step-preserving) diffs so every orphan (key, step) entry is reverted.
  • Add regression tests covering single and repeated unwinds across a step boundary to ensure commitment state/values don’t remain ahead of the unwind target.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
db/state/temporal_mem_batch.go Adds unwindChangesetRaw, merges/clears it appropriately, and uses it in Flush() to unwind all step entries.
db/state/execctx/domain_shared_test.go Adds new regression tests for commitment/domain unwind correctness across step boundaries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/state/execctx/domain_shared_test.go Outdated
Comment thread db/state/execctx/domain_shared_test.go Outdated
yperbasis and others added 2 commits April 21, 2026 10:51
Both post-unwind commitment-state checks were wrapped in `if len(stateVal)
>= 16 { ... }`. If a future bug ever deleted or truncated the state key,
the test would silently pass — exactly the failure mode these tests try
to catch. Switch to `require.GreaterOrEqual(len, 16)` so the assertion
fails loudly instead of being skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dd Merge-path test

- Flush: pass sd.unwindChangesetRaw directly instead of dereferencing into
  a local (sort is in-place on the shared backing array either way).
- Merge: document that MergeDiffSets requires both raw changesets sorted
  by Key, and note where the invariant is established upstream.
- Add TestSharedDomain_MergeUnwindAcrossStepBoundary, a regression test
  that splits a step-boundary-crossing unwind between two SharedDomains
  and calls SharedDomains.Merge before Flush. Verified to fail against
  the pre-fix Flush path (commitment state decoded blockNum=9 against
  unwindTarget=4) and pass with the raw-list replay.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis marked this pull request as ready for review April 21, 2026 09:30
@yperbasis yperbasis requested review from mh0lt and taratorio April 21, 2026 09:33
@yperbasis yperbasis linked an issue Apr 21, 2026 that may be closed by this pull request

@mh0lt mh0lt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its for a cross step unwind of a live value ? It would probably be worth doing this also for a delete (nil). As these have slightly different behaviour.

@mh0lt mh0lt enabled auto-merge April 21, 2026 10:51
@mh0lt mh0lt self-requested a review April 21, 2026 10:51
@mh0lt mh0lt added this pull request to the merge queue Apr 21, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request Apr 21, 2026
Addresses PR feedback that the raw-changeset Flush fix was only covered
for live-value writes; deletes use the same DomainUpdate path but emit
a tombstone dup (^step || nil, 8 bytes) whose side effects on the
DupSort sort order are easy to reason about incorrectly.

Scenario: write addr in step 0, overwrite in step 0, delete in step 1,
then unwind to a block inside step 0 via doms.Unwind() + doms.Flush().
The collapsed-map Flush would replay only the step-0 restore diff and
leave the step-1 tombstone in MDBX — and since ^1 sorts before ^0 in
DupSort, SeekExact would return the tombstone, getLatestFromDb would
interpret the 8-byte value as "absent", and the account would look
deleted instead of restored to its block-0 value.

Verified: FAILS against a reverted Flush (uses the collapsed map) with
"post-unwind: addr must be restored (got empty slice — step-1 tombstone
was not cleaned up)", PASSES against the raw-list Flush.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis enabled auto-merge April 21, 2026 11:07
@yperbasis yperbasis added this pull request to the merge queue Apr 21, 2026
Merged via the queue into main with commit 2813e8f Apr 21, 2026
36 checks passed
@yperbasis yperbasis deleted the execution/fix-commitment-unwind-divergence branch April 21, 2026 12:46
AskAlexSharov pushed a commit that referenced this pull request Apr 21, 2026
…ry orphan (#20716)

Cherry-pick of #20710 to release/3.4.

## r3.4-specific adaptations

- `sd1.Merge` call in the new test dropped `ctx` — release/3.4's
signature is `Merge(uint64, *SharedDomains, uint64)`.
- `TestSharedDomain_RepeatedUnwindAcrossStepBoundary` is skipped: it
depends on the overlay-pruning semantics from PR #20625 (not present in
release/3.4). The single-unwind variants
(`TestSharedDomain_UnwindAcrossStepBoundary`,
`TestSharedDomain_MergeUnwindAcrossStepBoundary`,
`TestSharedDomain_UnwindWithDeleteAcrossStepBoundary`) still guard the
per-step orphan-entry invariant this PR fixes.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
AskAlexSharov pushed a commit that referenced this pull request May 31, 2026
…TxNum (#20625) (#21538)

Cherry-pick of #20625 to `release/3.4`.

## Why

Addresses #21515 — the `gas used mismatch` / state-leak after a tip
reorg that recurs on v3.4.2. #21157 (already on `release/3.4`) fixed
only the diffset-lookup-by-wrong-hash part of the bug. This is the
second of the three reported sub-bugs: a stale read in the
`TemporalMemBatch` in-memory overlay. `Unwind()` recorded
`unwindToTxNum` + an `unwindChangeset` but never pruned `sd.domains` /
`sd.storage`, so `getLatest` kept returning a write made *inside* the
unwound `txNum` range — flipping an `SSTORE` from `SET (20000)` to
`RESET (2900)`, i.e. the ~17100-gas-per-slot shortfall users reported
(diffs 71016 / 73638).

The third sub-bug (#20710) is already on `release/3.4` as #20716
(`104e6d1a97`).

## Adaptations

None — clean cherry-pick, no `release/3.4`-specific changes were needed.

## Verification on release/3.4 + this patch

- `go build ./db/state/...` — clean
- `go vet ./db/state/...` — clean
- `go tool golangci-lint run ./db/state/...` — 0 issues
- regression test
`TestSharedDomain_UnwindDoesNotRestoreOverlayForNewKey` — passes
- `go test -short ./db/state/...` — all pass

Co-authored-by: Mark Holt <135143369+mh0lt@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gas mismatches on mainnet & Hoodi

3 participants