Skip to content

[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state#21246

Merged
AskAlexSharov merged 3 commits into
release/3.4from
cp/r34-canonical-pointer-reorg
May 21, 2026
Merged

[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state#21246
AskAlexSharov merged 3 commits into
release/3.4from
cp/r34-canonical-pointer-reorg

Conversation

@JkLondon

Copy link
Copy Markdown
Member

Summary

Fixes a stale-canonical-pointer leak in integration reset_state that was reintroducing phantom state on hoodi snapshotters running release/3.4 even after a clean rebuild. Observed at block 2 818 468 on snapshotter-bm-v34-hoodi-n1: local kv.HeaderCanonical[2 818 468] = 0xac2ee57a… (a sidechain block) while the real canonical was 0x27db29e4…. The next canonical block, 2 818 476, then died with insufficient funds for gas * price + value on 0xdD11751cdD3f6EFf01B1f6151B640685bfa5dB4a — a fee-recipient sweep-tx that was short by exactly 17 100 × 1 265 000 000 = 21 631 500 000 000 wei, the EIP-2929 SSTORE-SET-vs-RESET delta for a single storage slot on 0x8684adde… that had been first-written by the surviving sidechain block 8 blocks earlier.

This is not a cherry-pick of a merged PR. The fix originates on release/3.4 and pairs with #21157 (already on release/3.4).

Root cause

reset_state historically called ResetExec, which clears the MDBX state-domain tables (AccountsDomain, StorageDomain, CodeDomain, CommitmentDomain, ReceiptDomain, RCacheDomain), the inverted indexes and ChangeSets3, plus the Execution/Finish/CustomTrace/TxLookup/Witnesses stage progress. It deliberately left kv.HeaderCanonical, kv.HeaderTD and the Headers/BlockHashes/Bodies/Senders stage progress alone, on the assumption that those tables faithfully reflect the consensus-layer view of the chain.

That assumption breaks once a forkchoice update has ever committed a canonical hash whose chain later turned out to be the wrong side of a reorg. Concretely, on the affected snapshotter:

  1. An older forkchoice update for chain A succeeded; WriteCanonicalHash(A, 2 818 468) was committed.
  2. CL switched to chain B. The forkchoice handler walked back, built newCanonicals including B@2 818 468, ran the staged unwind and reached the forward executionPipeline.Run step.
  3. Forward execution failed on a downstream block because the unwind on a pre-[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157 binary failed to revert CodeDomain / StorageDomain writes made by A (see execution/stagedsync/stage_execute.go:findExecutedDiffsetAtHeight introduced in [r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157). The forkchoice handler returned BadBlock and the whole tx — including the WriteCanonicalHash(B, 2 818 468) issued at execution/execmodule/forkchoice.go:381 — rolled back. kv.HeaderCanonical[2 818 468] remained at A.
  4. The operator upgraded to a binary containing [r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157 and ran integration reset_state. ResetExec cleared the state domains, but the stale kv.HeaderCanonical[2 818 468] = A survived.
  5. On restart the staged sync caught up from the snapshot tip (block 2 810 162, txNum 100 781 249, end of step 258). The execution stage read each block's body using its canonical hash, hit the stale pointer at 2 818 468, re-applied chain A's block as canonical and re-installed the phantom code+slot writes on 0x8684adde…. The subsequent canonical tx 3 of block 2 818 476 then took the SSTORE-RESET path instead of SSTORE-SET — saving exactly 17 100 gas and starving the fee-recipient sweep at tx 14.

seg integrity --check=CommitmentRoot on the same datadir is green (both v2.0-commitment.0-256.kv and v2.0-commitment.256-258.kv recompute matching roots), confirming that no published .kv is contaminated. The phantom lived purely in MDBX, fed by kv.HeaderCanonical whose stale entries reset_state was failing to evict.

Fix

ResetCanonicalAboveTip(ctx, db, frozenBlocks) truncates kv.HeaderCanonical and kv.HeaderTD from frozenBlocks+1 forward, caps Headers/BlockHashes/Bodies/Senders stage progress at frozenBlocks, and re-anchors HeadHeaderHash to the canonical hash recorded at the tip (when one is present). ResetState takes a new frozenBlocks parameter and invokes the helper after ResetExec. cmd/integration/commands/reset_state.go passes br.FrozenBlocks().

Why this is safe to roll into ResetState:

  • The snapshot tip is by construction immutable — the canonical hashes for heights ≤ frozenBlocks live in the frozen blocks themselves, so leaving the table contents below the tip untouched is correct.
  • Stage-progress writes only run when an individual stage is observably above the tip; the helper is a no-op on a clean db.
  • After the truncation the next forkchoice update from the consensus layer drives canonical assignments for the post-tip range fresh, with no remnant of any pre-fix sidechain commit able to mislead a forward catchup.

The forkchoice-handler logic itself (execution/execmodule/forkchoice.go) and the findExecutedDiffsetAtHeight unwind path from #21157 are untouched. This PR addresses the persistent-leftover side of the problem; future reorgs on a node carrying #21157 will unwind cleanly and won't accumulate the leftover.

Test plan

execution/stagedsync/rawdbreset/reset_stages_test.go adds two cases:

  • TestResetCanonicalAboveTip_ClearsStaleSidechainPointers seeds kv.HeaderCanonical entries for heights 100..110 (including an explicit non-zero 0x99… at height 105 to make the assertion distinguish "missing" from "zero"), sets the four stage progresses to 110 and writes a stale HeadHeaderHash. After the call with snapshotTip = 100, the test asserts that canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, HeadHeaderHash has been re-anchored, and each of the four stages has been capped at 100.
  • TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip calls the helper on a db whose only canonical entry sits exactly at the tip and whose Headers stage already equals the tip; the test asserts the tip is preserved and stage progress is not regressed.

Both tests use temporaltest.NewTestDB for a real MDBX/temporal backend, run in well under one second, and exercise the public helper directly without needing a full ExecModule.

  • go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/... — green.
  • go test -count=1 -timeout=300s ./db/test/... — green (covers existing callers of ResetExec).
  • go build ./cmd/integration/... — green (only production caller of ResetState).
  • CI full unit-test + lint pipeline.
  • Manual: run integration reset_state --datadir=<DD> --chain=hoodi on the affected snapshotter; verify kv.HeaderCanonical[2 818 468] is cleared, restart, watch catchup pass 2 818 476 cleanly, run seg integrity --check=CommitmentRoot before re-enabling snapshot publication.

🤖 Generated with Claude Code

…pshot tip on ResetState

reset_state historically wiped the MDBX state-domain tables and reset the
Execution stage progress, but left kv.HeaderCanonical, kv.HeaderTD and the
Headers/BlockHashes/Bodies/Senders stage progress alone. A stale canonical
pointer at a height above the snapshot tip — typically a sidechain hash
deposited by a successful older forkchoice update whose later replacement
reorgs failed on execution and rolled back — survived the reset. After the
restart, the forward catchup read kv.HeaderCanonical[stale_height] and
applied the sidechain block as canonical, which re-introduced the very
phantom state reset_state was meant to clear. Observed on hoodi snapshotters
running release/3.4 at block 2 818 468 (local kv.HeaderCanonical pointed at
0xac2ee57a… while the real canonical was 0x27db29e4…).

Add ResetCanonicalAboveTip that truncates kv.HeaderCanonical and TD from
snapshotTip+1 forward, caps Headers/BlockHashes/Bodies/Senders progress at
the tip and re-anchors HeadHeaderHash. ResetState now takes a frozenBlocks
parameter and invokes the helper; the integration command passes
br.FrozenBlocks() at the call site. Stage-progress writes only run when
above the tip, so the routine is a safe no-op on a clean db.

After the change the next forkchoice update from the consensus layer
re-drives canonical assignments for the post-tip range fresh, with no
chance for stale sidechain pointers to be re-applied on catchup.

Unit tests in reset_stages_test.go cover both the stale-pointer cleanup
and the idempotent no-op-at-tip case.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin
BlockCollector's parent-TD lookup after reset_state on hoodi:

  [WARN] [BlockCollector] Failed to insert blocks
    err="parent's total difficulty not found with hash 544f7113… and
         height 2837709: <nil>"

TD records live under (hash, number) keys and exist for both canonical and
sidechain blocks at the same height. They are consulted by Caplin when it
imports a not-yet-canonical block to verify parent.TD. Wiping them
by-number across the post-tip range made every subsequent insert fail
until the headers were re-fetched from peers.

The stale TD records are independently keyed by hash and do not affect
canonical-hash assignment, so removing them was unnecessary for the
stale-canonical-pointer fix in the first place — drop the TruncateTd
call and document why.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
JkLondon pushed a commit that referenced this pull request May 18, 2026
Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin
BlockCollector's parent-TD lookup after reset_state on hoodi:

  [WARN] [BlockCollector] Failed to insert blocks
    err="parent's total difficulty not found with hash 544f7113… and
         height 2837709: <nil>"

TD records live under (hash, number) keys and exist for both canonical and
sidechain blocks at the same height. They are consulted by Caplin when it
imports a not-yet-canonical block to verify parent.TD. Wiping them
by-number across the post-tip range made every subsequent insert fail
until the headers were re-fetched from peers.

The stale TD records are independently keyed by hash and do not affect
canonical-hash assignment, so removing them was unnecessary for the
stale-canonical-pointer fix in the first place — drop the TruncateTd
call and document why.

Mirrors the corresponding r3.4 fixup at PR #21246.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
return fmt.Errorf("read %s stage progress: %w", st, err)
}
if progress > frozenBlocks {
if err := stages.SaveStageProgress(tx, st, frozenBlocks); err != nil {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is FillDBFromSnapshots which does set correct values of stages.
don't need copy that biz-logic here.

i think most robust solution is - we already doing it (in stage_header --reset and stage_exec --reset):

tx.ClearTable(CanonicalTable)
clearStageProgress(stages)
Call FillDBFromSnapshots()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates integration reset_state so state resets also clear stale post-snapshot canonical pointers that can cause forward execution to replay sidechain blocks after a rebuild.

Changes:

  • Adds ResetCanonicalAboveTip to truncate HeaderCanonical above the frozen block tip, cap related stage progress, and re-anchor HeadHeaderHash.
  • Passes the block snapshot tip from the integration command into ResetState.
  • Adds regression tests for stale canonical pointer cleanup and idempotent no-op behavior at the tip.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
execution/stagedsync/rawdbreset/reset_stages.go Extends reset-state behavior with post-tip canonical cleanup and stage progress capping.
execution/stagedsync/rawdbreset/reset_stages_test.go Adds tests covering stale canonical cleanup and already-at-tip behavior.
cmd/integration/commands/reset_state.go Supplies frozen block count from the block reader to reset-state cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +100 to +109
if frozenBlocks > 0 {
tipHash, err := rawdb.ReadCanonicalHash(tx, frozenBlocks)
if err != nil {
return fmt.Errorf("read canonical hash at snapshot tip: %w", err)
}
if tipHash != (common.Hash{}) {
if err := rawdb.WriteHeadHeaderHash(tx, tipHash); err != nil {
return fmt.Errorf("re-anchor HeadHeaderHash to snapshot tip: %w", err)
}
}
Comment on lines +47 to +51
// forward, truncates TD likewise, caps Headers/BlockHashes/Bodies/Senders
// stage progress at the snapshot tip, and re-anchors HeadHeaderHash to the
// tip's canonical hash. The next forkchoice update from CL then drives
// canonical assignments fresh, with no chance for stale sidechain pointers
// to survive.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

…arTable + FillDBFromSnapshots

Per @AskAlexSharov review on #21246: instead of open-coding a truncate-
by-block-number + per-stage cap, follow the established pattern that
stage_header --reset and stage_exec --reset use to rebuild canonical
markers and stage progress from frozen snapshot files.

Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and
collapse its body to three steps:

  tx.ClearTable(kv.HeaderCanonical)
  clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots)
  FillDBFromSnapshots(...)            // only when br.FrozenBlocks() > 0

ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing
to fall through to (correct outcome — no canonical above the snapshot
tip), and snapshot-range lookups fall through to the frozen segments
already (see BlockReader.CanonicalHash). FillDBFromSnapshots then
rewrites the snapshot-range markers, TD records and stage progress.

ResetState takes (dirs, blockReader, logger) instead of a precomputed
frozenBlocks count so it can hand them through to FillDBFromSnapshots.
The integration command passes the existing blocksIO(db, logger) reader.

Also addresses Copilot review comments:
  - kv.HeaderTD is not truncated by ResetCanonicalAndRefillFromSnapshots,
    so the test no longer mentions TD.
  - HeadHeaderHash re-anchoring is handled inside FillDBFromSnapshots
    (or naturally on the next forkchoice update when no snapshots exist),
    so the genesis-edge case no longer needs special handling here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
JkLondon pushed a commit that referenced this pull request May 19, 2026
…arTable + FillDBFromSnapshots

Per @AskAlexSharov review on #21246 (the r3.4 mirror of this PR):
instead of open-coding a truncate-by-block-number + per-stage cap,
follow the established pattern that stage_header --reset and
stage_exec --reset use to rebuild canonical markers and stage progress
from frozen snapshot files.

Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots
and collapse its body to three steps:

  tx.ClearTable(kv.HeaderCanonical)
  clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots)
  FillDBFromSnapshots(...)            // only when br.FrozenBlocks() > 0

ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing
to fall through to (correct outcome — no canonical above the snapshot
tip), and snapshot-range lookups fall through to the frozen segments
already (see BlockReader.CanonicalHash). FillDBFromSnapshots then
rewrites the snapshot-range markers, TD records and stage progress.

ResetState takes (dirs, blockReader, logger) instead of a precomputed
frozenBlocks count so it can hand them through to FillDBFromSnapshots.
The integration command passes the existing blocksIO(db, logger) reader.

Mirrors the corresponding r3.4 refresh at PR #21246.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@yperbasis yperbasis requested a review from Copilot May 20, 2026 07:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


func ResetState(db kv.TemporalRwDB, ctx context.Context) error {
func ResetState(db kv.TemporalRwDB, ctx context.Context, dirs datadir.Dirs, br services.FullBlockReader, logger log.Logger) error {
// don't reset senders here
Comment on lines +93 to +100
func ResetCanonicalAndRefillFromSnapshots(ctx context.Context, db kv.TemporalRwDB, dirs datadir.Dirs, br services.FullBlockReader, logger log.Logger) error {
return db.Update(ctx, func(tx kv.RwTx) error {
if err := tx.ClearTable(kv.HeaderCanonical); err != nil {
return fmt.Errorf("clear canonical hash table: %w", err)
}
if err := clearStageProgress(tx, stages.Headers, stages.BlockHashes, stages.Bodies, stages.Senders, stages.Snapshots); err != nil {
return fmt.Errorf("clear chain-data stage progress: %w", err)
}
Comment on lines +101 to +104
if br.FrozenBlocks() > 0 {
if err := FillDBFromSnapshots("reset_state_fill_db_from_snapshots", ctx, tx, dirs, br, logger); err != nil {
return fmt.Errorf("refill canonical markers from snapshots: %w", err)
}
Comment on lines +38 to +48
// newEmptyBlockReader returns a BlockReader backed by an empty RoSnapshots
// instance, so FrozenBlocks() == 0 and the reset helper takes its
// "no-snapshots, skip FillDBFromSnapshots" path. FillDBFromSnapshots itself
// is exercised by stage_header --reset / stage_exec --reset integration
// paths and is out of scope for this unit test.
func newEmptyBlockReader(t *testing.T, dirs datadir.Dirs, logger log.Logger) *freezeblocks.BlockReader {
t.Helper()
snaps := freezeblocks.NewRoSnapshots(ethconfig.BlocksFreezing{ChainName: networkname.Mainnet}, dirs.Snap, logger)
t.Cleanup(snaps.Close)
return freezeblocks.NewBlockReader(snaps, nil)
}
@AskAlexSharov AskAlexSharov merged commit 6c5e1d0 into release/3.4 May 21, 2026
27 checks passed
@AskAlexSharov AskAlexSharov deleted the cp/r34-canonical-pointer-reorg branch May 21, 2026 07:46
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 21, 2026
…pshot tip on reset_state (erigontech#21247)

## Summary

Defense-in-depth cleanup of `integration reset_state` so it also evicts
stale `kv.HeaderCanonical` / `kv.HeaderTD` entries above the snapshot
tip and caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress
at the tip. Without this, a stale canonical-hash pointer survives
`reset_state` and steers the next forward catchup back onto a
non-canonical block.

Mirrors the `release/3.4` backport at erigontech#21246; the file structure of
`execution/stagedsync/rawdbreset/reset_stages.go` and
`cmd/integration/commands/reset_state.go` is identical on `main` and
`release/3.4`, so the patch applies verbatim.

## Why it matters on main even after the recent unwind fixes

`main` already carries the two reorg-correctness fixes that prevent
**new** stale canonical pointers from accumulating:

- `bfa03df625` ("parallel-commitment correctness for reorg/unwind + SD
recreate")
- erigontech#21157 ("execution/stagedsync: find diffset by actually-executed hash
on unwind")

After those changes, a forkchoice update whose forward execution fails
no longer leaks state because the unwind correctly reverts the
previously-applied sidechain. So in steady state on a fresh `main`
build, `kv.HeaderCanonical` should only ever hold canonical hashes.

However, datadirs created on older Erigon versions (including any
`release/3.4` build older than 2026-05-14) can carry leftover sidechain
entries in `kv.HeaderCanonical` from the time when the unwind bug was
active. Today's `reset_state` clears MDBX domain state but leaves those
entries alone — a forward catchup then reads them as authoritative and
re-introduces the very phantom state that `reset_state` was meant to
clear. Operators upgrading such a datadir to a fresh `main` binary will
hit the same symptom that bit the `release/3.4` snapshotter on hoodi at
block 2 818 476 (`insufficient funds` on a fee-recipient sweep tx, off
by exactly the EIP-2929 SSTORE-SET-vs-RESET delta `17 100 × 1 265 000
000 wei`).

This PR is therefore a no-cost forward-compat improvement: it harmlessly
no-ops on a fresh `main` datadir and fully recovers older /
cross-version datadirs without requiring `integration stage_exec
--unwind=N` or any other manual incantation.

## Fix

`ResetCanonicalAboveTip(ctx, db, frozenBlocks)`:

- `rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)`
- `rawdb.TruncateTd(tx, frozenBlocks+1)`
- `WriteHeadHeaderHash` to canonical hash at the tip (when one is
recorded)
- Caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at
`frozenBlocks`

`ResetState` takes a new `frozenBlocks uint64` parameter and invokes the
helper after `ResetExec`. The only production caller,
`cmd/integration/commands/reset_state.go`, passes `br.FrozenBlocks()`.

Snapshot-tip data is by construction immutable, so leaving the
canonical-hash table contents at-or-below the tip untouched is correct;
stage-progress writes only run when an individual stage is observably
above the tip; the routine is a no-op when nothing is above the tip.

## Test plan

`execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases
against a real `temporaltest.NewTestDB` backend:

- `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` — seeds
canonical entries for heights 100..110 (including an explicit non-zero
`0x99…` at 105 to distinguish "missing" from "zero"), sets all four
stage progresses to 110 and writes a stale `HeadHeaderHash`. After the
call with `snapshotTip = 100`, asserts canonical hashes at 101..110 are
gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` is
re-anchored to the tip's canonical hash, and each stage has been capped
at 100.
- `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` — calls the helper
on a db whose only canonical entry sits exactly at the tip and whose
`Headers` stage already equals the tip; asserts the tip is preserved and
stage progress is not regressed.

- [x] `go test -count=1 -timeout=120s
./execution/stagedsync/rawdbreset/...` — green.
- [x] `go build ./cmd/integration/...` — green (only production caller
of `ResetState`).
- [ ] CI full unit-test + lint pipeline.

## Related

- `release/3.4` backport: erigontech#21246
- Reorg-unwind correctness on r3.4: erigontech#21157
- Reorg-unwind correctness on main: `bfa03df625`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: JkLondon <me@ilyamikheev.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants