execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state by JkLondon · Pull Request #21247 · erigontech/erigon

JkLondon · 2026-05-18T08:24:37Z

Summary

Defense-in-depth cleanup of integration reset_state so it also evicts stale kv.HeaderCanonical / kv.HeaderTD entries above the snapshot tip and caps Headers/BlockHashes/Bodies/Senders stage progress at the tip. Without this, a stale canonical-hash pointer survives reset_state and steers the next forward catchup back onto a non-canonical block.

Mirrors the release/3.4 backport at #21246; the file structure of execution/stagedsync/rawdbreset/reset_stages.go and cmd/integration/commands/reset_state.go is identical on main and release/3.4, so the patch applies verbatim.

Why it matters on main even after the recent unwind fixes

main already carries the two reorg-correctness fixes that prevent new stale canonical pointers from accumulating:

bfa03df625 ("parallel-commitment correctness for reorg/unwind + SD recreate")
[r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157 ("execution/stagedsync: find diffset by actually-executed hash on unwind")

After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh main build, kv.HeaderCanonical should only ever hold canonical hashes.

However, datadirs created on older Erigon versions (including any release/3.4 build older than 2026-05-14) can carry leftover sidechain entries in kv.HeaderCanonical from the time when the unwind bug was active. Today's reset_state clears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state that reset_state was meant to clear. Operators upgrading such a datadir to a fresh main binary will hit the same symptom that bit the release/3.4 snapshotter on hoodi at block 2 818 476 (insufficient funds on a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta 17 100 × 1 265 000 000 wei).

This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh main datadir and fully recovers older / cross-version datadirs without requiring integration stage_exec --unwind=N or any other manual incantation.

Fix

ResetCanonicalAboveTip(ctx, db, frozenBlocks):

rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)
rawdb.TruncateTd(tx, frozenBlocks+1)
WriteHeadHeaderHash to canonical hash at the tip (when one is recorded)
Caps Headers/BlockHashes/Bodies/Senders stage progress at frozenBlocks

ResetState takes a new frozenBlocks uint64 parameter and invokes the helper after ResetExec. The only production caller, cmd/integration/commands/reset_state.go, passes br.FrozenBlocks().

Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip.

Test plan

execution/stagedsync/rawdbreset/reset_stages_test.go adds two cases against a real temporaltest.NewTestDB backend:

TestResetCanonicalAboveTip_ClearsStaleSidechainPointers — seeds canonical entries for heights 100..110 (including an explicit non-zero 0x99… at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a stale HeadHeaderHash. After the call with snapshotTip = 100, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, HeadHeaderHash is re-anchored to the tip's canonical hash, and each stage has been capped at 100.
TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip — calls the helper on a db whose only canonical entry sits exactly at the tip and whose Headers stage already equals the tip; asserts the tip is preserved and stage progress is not regressed.
go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/... — green.
go build ./cmd/integration/... — green (only production caller of ResetState).
CI full unit-test + lint pipeline.

…pshot tip on ResetState reset_state historically wiped the MDBX state-domain tables and reset the Execution stage progress, but left kv.HeaderCanonical, kv.HeaderTD and the Headers/BlockHashes/Bodies/Senders stage progress alone. A stale canonical pointer at a height above the snapshot tip — typically a sidechain hash deposited by a successful older forkchoice update whose later replacement reorgs failed on execution and rolled back — survived the reset. After the restart, the forward catchup read kv.HeaderCanonical[stale_height] and applied the sidechain block as canonical, which re-introduced the very phantom state reset_state was meant to clear. Observed on hoodi snapshotters running release/3.4 at block 2 818 468 (local kv.HeaderCanonical pointed at 0xac2ee57a… while the real canonical was 0x27db29e4…); the fix is being backported back to main as defense-in-depth, since although main's forkchoice unwind path (bfa03df + #21157) no longer accumulates stale pointers from new reorgs, any older datadirs upgraded from buggy versions can still carry leftover entries that survive reset_state. Add ResetCanonicalAboveTip that truncates kv.HeaderCanonical and TD from snapshotTip+1 forward, caps Headers/BlockHashes/Bodies/Senders progress at the tip and re-anchors HeadHeaderHash. ResetState now takes a frozenBlocks parameter and invokes the helper; the integration command passes br.FrozenBlocks() at the call site. Stage-progress writes only run when above the tip, so the routine is a safe no-op on a clean db. After the change the next forkchoice update from the consensus layer re-drives canonical assignments for the post-tip range fresh, with no chance for stale sidechain pointers to be re-applied on catchup. Unit tests in reset_stages_test.go cover both the stale-pointer cleanup and the idempotent no-op-at-tip case. Mirrors the corresponding r3.4 backport at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin BlockCollector's parent-TD lookup after reset_state on hoodi: [WARN] [BlockCollector] Failed to insert blocks err="parent's total difficulty not found with hash 544f7113… and height 2837709: <nil>" TD records live under (hash, number) keys and exist for both canonical and sidechain blocks at the same height. They are consulted by Caplin when it imports a not-yet-canonical block to verify parent.TD. Wiping them by-number across the post-tip range made every subsequent insert fail until the headers were re-fetched from peers. The stale TD records are independently keyed by hash and do not affect canonical-hash assignment, so removing them was unnecessary for the stale-canonical-pointer fix in the first place — drop the TruncateTd call and document why. Mirrors the corresponding r3.4 fixup at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot

Pull request overview

Improves integration reset_state to also evict stale kv.HeaderCanonical entries above the snapshot tip and cap Headers/BlockHashes/Bodies/Senders stage progress to the tip. This protects datadirs created on older Erigon builds (where the pre-#21157 unwind bug could leak sidechain canonical pointers) from re-introducing phantom state on the next forward catchup. It is a no-op on a clean datadir.

Changes:

Add ResetCanonicalAboveTip helper that truncates kv.HeaderCanonical from frozenBlocks+1, re-anchors HeadHeaderHash to the tip's canonical hash, and caps block-import stage progress at the tip.
Thread a new frozenBlocks argument through ResetState, supplied by the sole production caller via br.FrozenBlocks().
Add unit tests covering the stale-pointer-clear and no-op-at-tip cases against temporaltest.NewTestDB.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
execution/stagedsync/rawdbreset/reset_stages.go	Adds `ResetCanonicalAboveTip` and wires it into `ResetState`.
execution/stagedsync/rawdbreset/reset_stages_test.go	New unit tests for the helper.
cmd/integration/commands/reset_state.go	Passes `br.FrozenBlocks()` into the updated `ResetState` signature.

Note: the top-level PR description's "Fix" bullet lists rawdb.TruncateTd(tx, frozenBlocks+1), but the code (and the in-file doc comment) deliberately do not truncate TD; the description should be reconciled with the implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+func ResetCanonicalAboveTip(ctx context.Context, db kv.TemporalRwDB, frozenBlocks uint64) error {
+	return db.Update(ctx, func(tx kv.RwTx) error {
+		if err := rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false /* markChainAsBad */); err != nil {
+			return fmt.Errorf("truncate canonical hash above snapshot tip: %w", err)
+		}
+
+		if frozenBlocks > 0 {
+			tipHash, err := rawdb.ReadCanonicalHash(tx, frozenBlocks)
+			if err != nil {
+				return fmt.Errorf("read canonical hash at snapshot tip: %w", err)
+			}
+			if tipHash != (common.Hash{}) {
+				if err := rawdb.WriteHeadHeaderHash(tx, tipHash); err != nil {
+					return fmt.Errorf("re-anchor HeadHeaderHash to snapshot tip: %w", err)
+				}
+			}
+		}
+
+		for _, st := range []stages.SyncStage{stages.Headers, stages.BlockHashes, stages.Bodies, stages.Senders} {
+			progress, err := stages.GetStageProgress(tx, st)
+			if err != nil {
+				return fmt.Errorf("read %s stage progress: %w", st, err)
+			}
+			if progress > frozenBlocks {
+				if err := stages.SaveStageProgress(tx, st, frozenBlocks); err != nil {
+					return fmt.Errorf("save %s stage progress: %w", st, err)
+				}
+			}
+		}
+		return nil
+	})
+}


+// forward, truncates TD likewise, caps Headers/BlockHashes/Bodies/Senders
+// stage progress at the snapshot tip, and re-anchors HeadHeaderHash to the
+// tip's canonical hash. The next forkchoice update from CL then drives
+// canonical assignments fresh, with no chance for stale sidechain pointers
+// to survive.


@AskAlexSharov

…arTable + FillDBFromSnapshots Per @AskAlexSharov review on #21246 (the r3.4 mirror of this PR): instead of open-coding a truncate-by-block-number + per-stage cap, follow the established pattern that stage_header --reset and stage_exec --reset use to rebuild canonical markers and stage progress from frozen snapshot files. Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and collapse its body to three steps: tx.ClearTable(kv.HeaderCanonical) clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots) FillDBFromSnapshots(...) // only when br.FrozenBlocks() > 0 ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing to fall through to (correct outcome — no canonical above the snapshot tip), and snapshot-range lookups fall through to the frozen segments already (see BlockReader.CanonicalHash). FillDBFromSnapshots then rewrites the snapshot-range markers, TD records and stage progress. ResetState takes (dirs, blockReader, logger) instead of a precomputed frozenBlocks count so it can hand them through to FillDBFromSnapshots. The integration command passes the existing blocksIO(db, logger) reader. Mirrors the corresponding r3.4 refresh at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

JkLondon requested review from AskAlexSharov, mh0lt and yperbasis as code owners May 18, 2026 08:24

yperbasis requested a review from Copilot May 19, 2026 10:24

Copilot started reviewing on behalf of yperbasis May 19, 2026 10:24 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

AskAlexSharov approved these changes May 21, 2026

View reviewed changes

AskAlexSharov added this pull request to the merge queue May 21, 2026

Merged via the queue into main with commit 949950c May 21, 2026
69 checks passed

AskAlexSharov deleted the fix/canonical-pointer-reset-state branch May 21, 2026 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state#21247

execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state#21247
AskAlexSharov merged 3 commits into
mainfrom
fix/canonical-pointer-reset-state

JkLondon commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JkLondon commented May 18, 2026

Summary

Why it matters on main even after the recent unwind fixes

Fix

Test plan

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants