Skip to content

[r34] db/state: preserve per-step unwind entries so Flush reverts every orphan#20716

Merged
AskAlexSharov merged 2 commits into
release/3.4from
cherry-pick-20710-r34
Apr 21, 2026
Merged

[r34] db/state: preserve per-step unwind entries so Flush reverts every orphan#20716
AskAlexSharov merged 2 commits into
release/3.4from
cherry-pick-20710-r34

Conversation

@yperbasis

Copy link
Copy Markdown
Member

Cherry-pick of #20710 to release/3.4.

r3.4-specific adaptations

  • sd1.Merge call in the new test dropped ctx — release/3.4's signature is Merge(uint64, *SharedDomains, uint64).
  • TestSharedDomain_RepeatedUnwindAcrossStepBoundary is skipped: it depends on the overlay-pruning semantics from PR db/state: prune TemporalMemBatch overlay entries past unwindToTxNum #20625 (not present in release/3.4). The single-unwind variants (TestSharedDomain_UnwindAcrossStepBoundary, TestSharedDomain_MergeUnwindAcrossStepBoundary, TestSharedDomain_UnwindWithDeleteAcrossStepBoundary) still guard the per-step orphan-entry invariant this PR fixes.

yperbasis and others added 2 commits April 21, 2026 14:59
…han (#20710)

## Summary

`TemporalMemBatch.Unwind` stored the incoming changeset in
`sd.unwindChangeset`, a map keyed by `change.Key[:len(change.Key)-8]` —
the real key with the 8-byte step suffix stripped. That shape is what
the `getLatest` fallback needs (one pre-unwind value per real key), but
`Flush` was re-exporting the same collapsed map to `tx.Unwind` →
`AggregatorRoTx.Unwind` → `DomainRoTx.unwind`. Because only one entry
per real key survived the collapse, per-step `DeleteCurrent` calls fired
for only one step per key and orphan domain-values entries were left
behind at every other step.

On mainnet this manifested as the commitment domain drifting one step
ahead of every other domain after a sequence of forkchoice-driven
unwinds. `SeekCommitment` then returned a stale `blockNum` above
`TxNums.Last()`, execution started past the truly-executed tip, and the
next block was rejected with `"nonce too high"`.

Fix: keep a separate `unwindChangesetRaw` (step-preserving) alongside
the collapsed map. `Flush` replays the raw list against `tx.Unwind`;
`getLatest` continues to consult the collapsed map. `Merge` carries (and
`ClearRam` clears) both copies.

## Why

On a mainnet node, `integration print_stages` showed:
```
accounts    progress(txnum) 3460929701   step 8859
storage     progress(txnum) 3460929699   step 8859
code        progress(txnum) 3460929625   step 8859
commitment  progress(txnum) 3460937500   step 8860   <-- ahead by one full step
receipt     progress(txnum) 3460929700   step 8859
```

A direct read of the commitment values table turned up 22,038 orphan
step-8860 entries plus the `"state"` key pointing at `blockNum=24923992`
after execution had been unwound to 24923939. The observable symptom was
an endless `invalid block, txnIdx=2, nonce too high: address
0x13c12Fe9…, tx: 14 state: 13` retry loop every ~12s.

## Repro

New test `TestSharedDomain_RepeatedUnwindAcrossStepBoundary` writes
accounts+commitment across three steps, runs three (execute, flush,
unwind-to-earlier-step, flush) cycles and asserts that
- the commitment `"state"` key's decoded `blockNum ≤ unwindTarget`, and
- the commitment values table has no entries with `step >
unwindTarget/stepSize`.

On main this fails with `step=2 bn=29 / step=1 bn=19` dups still present
after every unwind; with the fix the only remaining dups are at step 0
(the unwind-target step) and the latest points at the restored block.

`TestSharedDomain_UnwindAcrossStepBoundary` is a simpler single-unwind
regression guard for the same invariant.

## Notes

- The `Merge` path now merges `unwindChangesetRaw` with the same
newer/older directionality as `unwindChangeset`. It uses
`changeset.MergeDiffSets`, which already dedupes by full `(key, step)` —
so this doesn't reintroduce the collapse bug.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
(cherry picked from commit 2813e8f)
TestSharedDomain_RepeatedUnwindAcrossStepBoundary depends on the
overlay-pruning semantics introduced by PR #20625 (prune TemporalMemBatch
overlay entries past unwindToTxNum), which is only present on main.
release/3.4 uses a different approach (stagedsync.ClearLatestCache +
flush unwind changeset to DB before re-execution), so the test cannot
run unmodified. Skip it; the single-unwind variants still guard the
per-step orphan-entry invariant that #20710 fixes.

Also drop the ctx argument from the sd1.Merge call — release/3.4's
SharedDomains.Merge signature does not take a context.
@AskAlexSharov AskAlexSharov merged commit 104e6d1 into release/3.4 Apr 21, 2026
21 checks passed
@AskAlexSharov AskAlexSharov deleted the cherry-pick-20710-r34 branch April 21, 2026 13:33
AskAlexSharov pushed a commit that referenced this pull request May 31, 2026
…TxNum (#20625) (#21538)

Cherry-pick of #20625 to `release/3.4`.

## Why

Addresses #21515 — the `gas used mismatch` / state-leak after a tip
reorg that recurs on v3.4.2. #21157 (already on `release/3.4`) fixed
only the diffset-lookup-by-wrong-hash part of the bug. This is the
second of the three reported sub-bugs: a stale read in the
`TemporalMemBatch` in-memory overlay. `Unwind()` recorded
`unwindToTxNum` + an `unwindChangeset` but never pruned `sd.domains` /
`sd.storage`, so `getLatest` kept returning a write made *inside* the
unwound `txNum` range — flipping an `SSTORE` from `SET (20000)` to
`RESET (2900)`, i.e. the ~17100-gas-per-slot shortfall users reported
(diffs 71016 / 73638).

The third sub-bug (#20710) is already on `release/3.4` as #20716
(`104e6d1a97`).

## Adaptations

None — clean cherry-pick, no `release/3.4`-specific changes were needed.

## Verification on release/3.4 + this patch

- `go build ./db/state/...` — clean
- `go vet ./db/state/...` — clean
- `go tool golangci-lint run ./db/state/...` — 0 issues
- regression test
`TestSharedDomain_UnwindDoesNotRestoreOverlayForNewKey` — passes
- `go test -short ./db/state/...` — all pass

Co-authored-by: Mark Holt <135143369+mh0lt@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants