[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state by JkLondon · Pull Request #21246 · erigontech/erigon

JkLondon · 2026-05-18T08:15:48Z

Summary

Fixes a stale-canonical-pointer leak in integration reset_state that was reintroducing phantom state on hoodi snapshotters running release/3.4 even after a clean rebuild. Observed at block 2 818 468 on snapshotter-bm-v34-hoodi-n1: local kv.HeaderCanonical[2 818 468] = 0xac2ee57a… (a sidechain block) while the real canonical was 0x27db29e4…. The next canonical block, 2 818 476, then died with insufficient funds for gas * price + value on 0xdD11751cdD3f6EFf01B1f6151B640685bfa5dB4a — a fee-recipient sweep-tx that was short by exactly 17 100 × 1 265 000 000 = 21 631 500 000 000 wei, the EIP-2929 SSTORE-SET-vs-RESET delta for a single storage slot on 0x8684adde… that had been first-written by the surviving sidechain block 8 blocks earlier.

This is not a cherry-pick of a merged PR. The fix originates on release/3.4 and pairs with #21157 (already on release/3.4).

Root cause

reset_state historically called ResetExec, which clears the MDBX state-domain tables (AccountsDomain, StorageDomain, CodeDomain, CommitmentDomain, ReceiptDomain, RCacheDomain), the inverted indexes and ChangeSets3, plus the Execution/Finish/CustomTrace/TxLookup/Witnesses stage progress. It deliberately left kv.HeaderCanonical, kv.HeaderTD and the Headers/BlockHashes/Bodies/Senders stage progress alone, on the assumption that those tables faithfully reflect the consensus-layer view of the chain.

That assumption breaks once a forkchoice update has ever committed a canonical hash whose chain later turned out to be the wrong side of a reorg. Concretely, on the affected snapshotter:

An older forkchoice update for chain A succeeded; WriteCanonicalHash(A, 2 818 468) was committed.
CL switched to chain B. The forkchoice handler walked back, built newCanonicals including B@2 818 468, ran the staged unwind and reached the forward executionPipeline.Run step.
Forward execution failed on a downstream block because the unwind on a pre-[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157 binary failed to revert CodeDomain / StorageDomain writes made by A (see execution/stagedsync/stage_execute.go:findExecutedDiffsetAtHeight introduced in [r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157). The forkchoice handler returned BadBlock and the whole tx — including the WriteCanonicalHash(B, 2 818 468) issued at execution/execmodule/forkchoice.go:381 — rolled back. kv.HeaderCanonical[2 818 468] remained at A.
The operator upgraded to a binary containing [r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157 and ran integration reset_state. ResetExec cleared the state domains, but the stale kv.HeaderCanonical[2 818 468] = A survived.
On restart the staged sync caught up from the snapshot tip (block 2 810 162, txNum 100 781 249, end of step 258). The execution stage read each block's body using its canonical hash, hit the stale pointer at 2 818 468, re-applied chain A's block as canonical and re-installed the phantom code+slot writes on 0x8684adde…. The subsequent canonical tx 3 of block 2 818 476 then took the SSTORE-RESET path instead of SSTORE-SET — saving exactly 17 100 gas and starving the fee-recipient sweep at tx 14.

seg integrity --check=CommitmentRoot on the same datadir is green (both v2.0-commitment.0-256.kv and v2.0-commitment.256-258.kv recompute matching roots), confirming that no published .kv is contaminated. The phantom lived purely in MDBX, fed by kv.HeaderCanonical whose stale entries reset_state was failing to evict.

Fix

ResetCanonicalAboveTip(ctx, db, frozenBlocks) truncates kv.HeaderCanonical and kv.HeaderTD from frozenBlocks+1 forward, caps Headers/BlockHashes/Bodies/Senders stage progress at frozenBlocks, and re-anchors HeadHeaderHash to the canonical hash recorded at the tip (when one is present). ResetState takes a new frozenBlocks parameter and invokes the helper after ResetExec. cmd/integration/commands/reset_state.go passes br.FrozenBlocks().

Why this is safe to roll into ResetState:

The snapshot tip is by construction immutable — the canonical hashes for heights ≤ frozenBlocks live in the frozen blocks themselves, so leaving the table contents below the tip untouched is correct.
Stage-progress writes only run when an individual stage is observably above the tip; the helper is a no-op on a clean db.
After the truncation the next forkchoice update from the consensus layer drives canonical assignments for the post-tip range fresh, with no remnant of any pre-fix sidechain commit able to mislead a forward catchup.

The forkchoice-handler logic itself (execution/execmodule/forkchoice.go) and the findExecutedDiffsetAtHeight unwind path from #21157 are untouched. This PR addresses the persistent-leftover side of the problem; future reorgs on a node carrying #21157 will unwind cleanly and won't accumulate the leftover.

Test plan

execution/stagedsync/rawdbreset/reset_stages_test.go adds two cases:

TestResetCanonicalAboveTip_ClearsStaleSidechainPointers seeds kv.HeaderCanonical entries for heights 100..110 (including an explicit non-zero 0x99… at height 105 to make the assertion distinguish "missing" from "zero"), sets the four stage progresses to 110 and writes a stale HeadHeaderHash. After the call with snapshotTip = 100, the test asserts that canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, HeadHeaderHash has been re-anchored, and each of the four stages has been capped at 100.
TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip calls the helper on a db whose only canonical entry sits exactly at the tip and whose Headers stage already equals the tip; the test asserts the tip is preserved and stage progress is not regressed.

Both tests use temporaltest.NewTestDB for a real MDBX/temporal backend, run in well under one second, and exercise the public helper directly without needing a full ExecModule.

go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/... — green.
go test -count=1 -timeout=300s ./db/test/... — green (covers existing callers of ResetExec).
go build ./cmd/integration/... — green (only production caller of ResetState).
CI full unit-test + lint pipeline.
Manual: run integration reset_state --datadir=<DD> --chain=hoodi on the affected snapshotter; verify kv.HeaderCanonical[2 818 468] is cleared, restart, watch catchup pass 2 818 476 cleanly, run seg integrity --check=CommitmentRoot before re-enabling snapshot publication.

🤖 Generated with Claude Code

…pshot tip on ResetState reset_state historically wiped the MDBX state-domain tables and reset the Execution stage progress, but left kv.HeaderCanonical, kv.HeaderTD and the Headers/BlockHashes/Bodies/Senders stage progress alone. A stale canonical pointer at a height above the snapshot tip — typically a sidechain hash deposited by a successful older forkchoice update whose later replacement reorgs failed on execution and rolled back — survived the reset. After the restart, the forward catchup read kv.HeaderCanonical[stale_height] and applied the sidechain block as canonical, which re-introduced the very phantom state reset_state was meant to clear. Observed on hoodi snapshotters running release/3.4 at block 2 818 468 (local kv.HeaderCanonical pointed at 0xac2ee57a… while the real canonical was 0x27db29e4…). Add ResetCanonicalAboveTip that truncates kv.HeaderCanonical and TD from snapshotTip+1 forward, caps Headers/BlockHashes/Bodies/Senders progress at the tip and re-anchors HeadHeaderHash. ResetState now takes a frozenBlocks parameter and invokes the helper; the integration command passes br.FrozenBlocks() at the call site. Stage-progress writes only run when above the tip, so the routine is a safe no-op on a clean db. After the change the next forkchoice update from the consensus layer re-drives canonical assignments for the post-tip range fresh, with no chance for stale sidechain pointers to be re-applied on catchup. Unit tests in reset_stages_test.go cover both the stale-pointer cleanup and the idempotent no-op-at-tip case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin BlockCollector's parent-TD lookup after reset_state on hoodi: [WARN] [BlockCollector] Failed to insert blocks err="parent's total difficulty not found with hash 544f7113… and height 2837709: <nil>" TD records live under (hash, number) keys and exist for both canonical and sidechain blocks at the same height. They are consulted by Caplin when it imports a not-yet-canonical block to verify parent.TD. Wiping them by-number across the post-tip range made every subsequent insert fail until the headers were re-fetched from peers. The stale TD records are independently keyed by hash and do not affect canonical-hash assignment, so removing them was unnecessary for the stale-canonical-pointer fix in the first place — drop the TruncateTd call and document why. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin BlockCollector's parent-TD lookup after reset_state on hoodi: [WARN] [BlockCollector] Failed to insert blocks err="parent's total difficulty not found with hash 544f7113… and height 2837709: <nil>" TD records live under (hash, number) keys and exist for both canonical and sidechain blocks at the same height. They are consulted by Caplin when it imports a not-yet-canonical block to verify parent.TD. Wiping them by-number across the post-tip range made every subsequent insert fail until the headers were re-fetched from peers. The stale TD records are independently keyed by hash and do not affect canonical-hash assignment, so removing them was unnecessary for the stale-canonical-pointer fix in the first place — drop the TruncateTd call and document why. Mirrors the corresponding r3.4 fixup at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

AskAlexSharov · 2026-05-18T10:24:45Z

+				return fmt.Errorf("read %s stage progress: %w", st, err)
+			}
+			if progress > frozenBlocks {
+				if err := stages.SaveStageProgress(tx, st, frozenBlocks); err != nil {


there is FillDBFromSnapshots which does set correct values of stages.
don't need copy that biz-logic here.

i think most robust solution is - we already doing it (in stage_header --reset and stage_exec --reset):

tx.ClearTable(CanonicalTable) clearStageProgress(stages) Call FillDBFromSnapshots()

also see https://github.com/erigontech/erigon/pull/21210/changes#diff-6c1cb2bf086bb20a6189c4c0d4bb37bf152ccb86b5389c93342a6412af73ad5cR193

Copilot

Pull request overview

This PR updates integration reset_state so state resets also clear stale post-snapshot canonical pointers that can cause forward execution to replay sidechain blocks after a rebuild.

Changes:

Adds ResetCanonicalAboveTip to truncate HeaderCanonical above the frozen block tip, cap related stage progress, and re-anchor HeadHeaderHash.
Passes the block snapshot tip from the integration command into ResetState.
Adds regression tests for stale canonical pointer cleanup and idempotent no-op behavior at the tip.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`execution/stagedsync/rawdbreset/reset_stages.go`	Extends reset-state behavior with post-tip canonical cleanup and stage progress capping.
`execution/stagedsync/rawdbreset/reset_stages_test.go`	Adds tests covering stale canonical cleanup and already-at-tip behavior.
`cmd/integration/commands/reset_state.go`	Supplies frozen block count from the block reader to reset-state cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+		if frozenBlocks > 0 {
+			tipHash, err := rawdb.ReadCanonicalHash(tx, frozenBlocks)
+			if err != nil {
+				return fmt.Errorf("read canonical hash at snapshot tip: %w", err)
+			}
+			if tipHash != (common.Hash{}) {
+				if err := rawdb.WriteHeadHeaderHash(tx, tipHash); err != nil {
+					return fmt.Errorf("re-anchor HeadHeaderHash to snapshot tip: %w", err)
+				}
+			}


JkLondon · 2026-05-19T10:57:43Z

+// forward, truncates TD likewise, caps Headers/BlockHashes/Bodies/Senders
+// stage progress at the snapshot tip, and re-anchors HeadHeaderHash to the
+// tip's canonical hash. The next forkchoice update from CL then drives
+// canonical assignments fresh, with no chance for stale sidechain pointers
+// to survive.


@AskAlexSharov

…arTable + FillDBFromSnapshots Per @AskAlexSharov review on #21246: instead of open-coding a truncate- by-block-number + per-stage cap, follow the established pattern that stage_header --reset and stage_exec --reset use to rebuild canonical markers and stage progress from frozen snapshot files. Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and collapse its body to three steps: tx.ClearTable(kv.HeaderCanonical) clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots) FillDBFromSnapshots(...) // only when br.FrozenBlocks() > 0 ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing to fall through to (correct outcome — no canonical above the snapshot tip), and snapshot-range lookups fall through to the frozen segments already (see BlockReader.CanonicalHash). FillDBFromSnapshots then rewrites the snapshot-range markers, TD records and stage progress. ResetState takes (dirs, blockReader, logger) instead of a precomputed frozenBlocks count so it can hand them through to FillDBFromSnapshots. The integration command passes the existing blocksIO(db, logger) reader. Also addresses Copilot review comments: - kv.HeaderTD is not truncated by ResetCanonicalAndRefillFromSnapshots, so the test no longer mentions TD. - HeadHeaderHash re-anchoring is handled inside FillDBFromSnapshots (or naturally on the next forkchoice update when no snapshots exist), so the genesis-edge case no longer needs special handling here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@AskAlexSharov

…arTable + FillDBFromSnapshots Per @AskAlexSharov review on #21246 (the r3.4 mirror of this PR): instead of open-coding a truncate-by-block-number + per-stage cap, follow the established pattern that stage_header --reset and stage_exec --reset use to rebuild canonical markers and stage progress from frozen snapshot files. Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and collapse its body to three steps: tx.ClearTable(kv.HeaderCanonical) clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots) FillDBFromSnapshots(...) // only when br.FrozenBlocks() > 0 ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing to fall through to (correct outcome — no canonical above the snapshot tip), and snapshot-range lookups fall through to the frozen segments already (see BlockReader.CanonicalHash). FillDBFromSnapshots then rewrites the snapshot-range markers, TD records and stage progress. ResetState takes (dirs, blockReader, logger) instead of a precomputed frozenBlocks count so it can hand them through to FillDBFromSnapshots. The integration command passes the existing blocksIO(db, logger) reader. Mirrors the corresponding r3.4 refresh at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.


-func ResetState(db kv.TemporalRwDB, ctx context.Context) error {
+func ResetState(db kv.TemporalRwDB, ctx context.Context, dirs datadir.Dirs, br services.FullBlockReader, logger log.Logger) error {
 	// don't reset senders here


+func ResetCanonicalAndRefillFromSnapshots(ctx context.Context, db kv.TemporalRwDB, dirs datadir.Dirs, br services.FullBlockReader, logger log.Logger) error {
+	return db.Update(ctx, func(tx kv.RwTx) error {
+		if err := tx.ClearTable(kv.HeaderCanonical); err != nil {
+			return fmt.Errorf("clear canonical hash table: %w", err)
+		}
+		if err := clearStageProgress(tx, stages.Headers, stages.BlockHashes, stages.Bodies, stages.Senders, stages.Snapshots); err != nil {
+			return fmt.Errorf("clear chain-data stage progress: %w", err)
+		}


+		if br.FrozenBlocks() > 0 {
+			if err := FillDBFromSnapshots("reset_state_fill_db_from_snapshots", ctx, tx, dirs, br, logger); err != nil {
+				return fmt.Errorf("refill canonical markers from snapshots: %w", err)
+			}


+// newEmptyBlockReader returns a BlockReader backed by an empty RoSnapshots
+// instance, so FrozenBlocks() == 0 and the reset helper takes its
+// "no-snapshots, skip FillDBFromSnapshots" path. FillDBFromSnapshots itself
+// is exercised by stage_header --reset / stage_exec --reset integration
+// paths and is out of scope for this unit test.
+func newEmptyBlockReader(t *testing.T, dirs datadir.Dirs, logger log.Logger) *freezeblocks.BlockReader {
+	t.Helper()
+	snaps := freezeblocks.NewRoSnapshots(ethconfig.BlocksFreezing{ChainName: networkname.Mainnet}, dirs.Snap, logger)
+	t.Cleanup(snaps.Close)
+	return freezeblocks.NewBlockReader(snaps, nil)
+}


…pshot tip on reset_state (erigontech#21247) ## Summary Defense-in-depth cleanup of `integration reset_state` so it also evicts stale `kv.HeaderCanonical` / `kv.HeaderTD` entries above the snapshot tip and caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at the tip. Without this, a stale canonical-hash pointer survives `reset_state` and steers the next forward catchup back onto a non-canonical block. Mirrors the `release/3.4` backport at erigontech#21246; the file structure of `execution/stagedsync/rawdbreset/reset_stages.go` and `cmd/integration/commands/reset_state.go` is identical on `main` and `release/3.4`, so the patch applies verbatim. ## Why it matters on main even after the recent unwind fixes `main` already carries the two reorg-correctness fixes that prevent **new** stale canonical pointers from accumulating: - `bfa03df625` ("parallel-commitment correctness for reorg/unwind + SD recreate") - erigontech#21157 ("execution/stagedsync: find diffset by actually-executed hash on unwind") After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh `main` build, `kv.HeaderCanonical` should only ever hold canonical hashes. However, datadirs created on older Erigon versions (including any `release/3.4` build older than 2026-05-14) can carry leftover sidechain entries in `kv.HeaderCanonical` from the time when the unwind bug was active. Today's `reset_state` clears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state that `reset_state` was meant to clear. Operators upgrading such a datadir to a fresh `main` binary will hit the same symptom that bit the `release/3.4` snapshotter on hoodi at block 2 818 476 (`insufficient funds` on a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta `17 100 × 1 265 000 000 wei`). This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh `main` datadir and fully recovers older / cross-version datadirs without requiring `integration stage_exec --unwind=N` or any other manual incantation. ## Fix `ResetCanonicalAboveTip(ctx, db, frozenBlocks)`: - `rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)` - `rawdb.TruncateTd(tx, frozenBlocks+1)` - `WriteHeadHeaderHash` to canonical hash at the tip (when one is recorded) - Caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at `frozenBlocks` `ResetState` takes a new `frozenBlocks uint64` parameter and invokes the helper after `ResetExec`. The only production caller, `cmd/integration/commands/reset_state.go`, passes `br.FrozenBlocks()`. Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip. ## Test plan `execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases against a real `temporaltest.NewTestDB` backend: - `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` — seeds canonical entries for heights 100..110 (including an explicit non-zero `0x99…` at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` is re-anchored to the tip's canonical hash, and each stage has been capped at 100. - `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` — calls the helper on a db whose only canonical entry sits exactly at the tip and whose `Headers` stage already equals the tip; asserts the tip is preserved and stage progress is not regressed. - [x] `go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...` — green. - [x] `go build ./cmd/integration/...` — green (only production caller of `ResetState`). - [ ] CI full unit-test + lint pipeline. ## Related - `release/3.4` backport: erigontech#21246 - Reorg-unwind correctness on r3.4: erigontech#21157 - Reorg-unwind correctness on main: `bfa03df625` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: JkLondon <me@ilyamikheev.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

JkLondon requested review from AskAlexSharov, Giulio2002, mh0lt and yperbasis as code owners May 18, 2026 08:15

JkLondon mentioned this pull request May 18, 2026

execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state #21247

Merged

3 tasks

AskAlexSharov requested changes May 18, 2026

View reviewed changes

AskAlexSharov requested a review from Copilot May 19, 2026 01:38

Copilot started reviewing on behalf of AskAlexSharov May 19, 2026 01:39 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

yperbasis requested a review from Copilot May 20, 2026 07:56

Copilot started reviewing on behalf of yperbasis May 20, 2026 07:56 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

AskAlexSharov approved these changes May 21, 2026

View reviewed changes

AskAlexSharov merged commit 6c5e1d0 into release/3.4 May 21, 2026
27 checks passed

AskAlexSharov deleted the cp/r34-canonical-pointer-reorg branch May 21, 2026 07:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state#21246

[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state#21246
AskAlexSharov merged 3 commits into
release/3.4from
cp/r34-canonical-pointer-reorg

JkLondon commented May 18, 2026

Uh oh!

AskAlexSharov May 18, 2026

Uh oh!

AskAlexSharov May 18, 2026

Uh oh!

JkLondon May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

JkLondon May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JkLondon commented May 18, 2026

Summary

Root cause

Fix

Test plan

Uh oh!

AskAlexSharov May 18, 2026

Choose a reason for hiding this comment

Uh oh!

AskAlexSharov May 18, 2026

Choose a reason for hiding this comment

Uh oh!

JkLondon May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

JkLondon May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants