Skip to content

[r3.4] db/state: prune TemporalMemBatch overlay entries past unwindToTxNum (#20625)#21538

Merged
AskAlexSharov merged 1 commit into
release/3.4from
jklondon/cp-20625-overlay-unwind-r34
May 31, 2026
Merged

[r3.4] db/state: prune TemporalMemBatch overlay entries past unwindToTxNum (#20625)#21538
AskAlexSharov merged 1 commit into
release/3.4from
jklondon/cp-20625-overlay-unwind-r34

Conversation

@JkLondon

Copy link
Copy Markdown
Member

Cherry-pick of #20625 to release/3.4.

Why

Addresses #21515 — the gas used mismatch / state-leak after a tip reorg that recurs on v3.4.2. #21157 (already on release/3.4) fixed only the diffset-lookup-by-wrong-hash part of the bug. This is the second of the three reported sub-bugs: a stale read in the TemporalMemBatch in-memory overlay. Unwind() recorded unwindToTxNum + an unwindChangeset but never pruned sd.domains / sd.storage, so getLatest kept returning a write made inside the unwound txNum range — flipping an SSTORE from SET (20000) to RESET (2900), i.e. the ~17100-gas-per-slot shortfall users reported (diffs 71016 / 73638).

The third sub-bug (#20710) is already on release/3.4 as #20716 (104e6d1a97).

Adaptations

None — clean cherry-pick, no release/3.4-specific changes were needed.

Verification on release/3.4 + this patch

  • go build ./db/state/... — clean
  • go vet ./db/state/... — clean
  • go tool golangci-lint run ./db/state/... — 0 issues
  • regression test TestSharedDomain_UnwindDoesNotRestoreOverlayForNewKey — passes
  • go test -short ./db/state/... — all pass

…TxNum (#20625)

## Summary

Fix a post-unwind stale-read in the in-memory domain overlay that causes
gas-used mismatches on post-Fusaka mainnet catch-up.

`TemporalMemBatch` stores per-key overlay writes as `[]dataWithTxNum`,
each entry stamped with its write `txNum`. `getLatest` returns
`dataWithTxNums[len-1]` — the most recently appended entry — without
comparing it against `sd.unwindToTxNum`. `Unwind()` only recorded
`unwindToTxNum` + an `unwindChangeset`; it never touched `sd.domains` /
`sd.storage`. Since `unwindChangeset` is consulted only when the overlay
misses, any key still present in the overlay kept returning a pre-unwind
write made *inside* the unwound `txNum` range.

## Observed symptom

Post-Fusaka mainnet catch-up, after a forkchoice-driven unwind. The
first re-executed block reads a storage slot that was first-written
inside the unwound range. The overlay returns the post-target write,
flipping the SSTORE cost from `SET (20000)` to `RESET (2900)` —
**exactly a 17100-gas shortfall per affected slot**.

- Block 24,898,955: `diff=-34200` (2 slots × 17100)
- Block 24,899,403: `diff=-73829` (compound — several slots affected)

Live trace instrumentation (not shipped) captured 3,082 `SD_STALE_READ`
events between an unwind at `txNum=3454259398` and the resulting
mismatch at block 24,899,403.

## Fix

On `Unwind`, walk `sd.domains` and `sd.storage` and drop any
`dataWithTxNum` whose `txNum > unwindToTxNum`. If a key's slice empties
out, delete the key so the `unwindChangeset` fallback (or the underlying
tx) supplies the pre-unwind answer. Runs under `sd.latestStateLock` so
the transition is atomic to concurrent reads. Storage-btree mutations
are staged after `Scan` to respect btree iterator rules.

## Regression test

`TestSharedDomain_UnwindDoesNotRestoreOverlayForNewKey` in
`db/state/execctx/domain_shared_test.go`:
- writes a first-time storage value at `txNum=100`
- calls `Unwind(50)`
- asserts the overlay no longer returns the post-target write

Test fails on pre-fix code with the exact error that mirrors the mainnet
symptom; passes with this change.

## Test plan

- [x] `go test -short ./db/state/...` — all pass
- [x] `make lint` — 0 issues
- [x] `make erigon` — builds clean
- [ ] Manual sync verification: post-Fusaka mainnet with `chaindata/`
wiped and `snapshots/` retained (same repro that produced
block-24,899,403 mismatch) — sync progresses past the catch-up /
first-forkchoice-unwind window without a gas mismatch.

## Known adjacent issues, out of scope

DB-layer siblings of this bug exist separately and are *not* addressed
here:

1. `db/state/domain.go:1317` — on-disk unwind currently conflates `nil`
("different step, skip") and `[]byte{}` ("key was absent, write
tombstone") via `if len(value) > 0`, so first-time writes in the unwound
range leave no restoring tombstone.
2. `db/state/domain.go:1665` — `getLatestFromDb` discards deletion
markers at a step within file range, so the caller falls through to
`getLatestFromFiles`, which has no concept of deletions and returns
stale pre-deletion data.

Both were previously addressed by #20483 and reverted by #20509 while
regressions were investigated. They need their own narrower fixes with
dedicated regression guards and should be staged as separate PRs so
they're independently revertible.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit 2698b38)
@AskAlexSharov AskAlexSharov merged commit 09c96ff into release/3.4 May 31, 2026
22 of 23 checks passed
@AskAlexSharov AskAlexSharov deleted the jklondon/cp-20625-overlay-unwind-r34 branch May 31, 2026 12:15
yperbasis added a commit that referenced this pull request Jun 1, 2026
Adds the **v3.4.3** section to `ChangeLog.md`, covering the user-facing
changes merged to `release/3.4` since v3.4.2, and sets the v3.4.2 header
to its release date (2026-05-22).

**Bugfixes**
- #21538 — second fix for the post-reorg `gas used mismatch` /
state-leak still hitting v3.4.2 users
- #21507 — `debug_getModifiedAccountsByHash` / `ByNumber` now match Geth
semantics
- #21389 — `--rpc.logs.maxresults` (documented in 3.4.0) now takes
effect via the CLI

**Improvements**
- #21502 — fail-fast on oversized `engine_newPayload` backward download
(less per-slot log spam / wasted fetches)

Docs-only / internal PRs (#21451, #21408) are intentionally omitted.

Version bump tracked separately in #21547.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants