execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state#21247
Conversation
…pshot tip on ResetState reset_state historically wiped the MDBX state-domain tables and reset the Execution stage progress, but left kv.HeaderCanonical, kv.HeaderTD and the Headers/BlockHashes/Bodies/Senders stage progress alone. A stale canonical pointer at a height above the snapshot tip — typically a sidechain hash deposited by a successful older forkchoice update whose later replacement reorgs failed on execution and rolled back — survived the reset. After the restart, the forward catchup read kv.HeaderCanonical[stale_height] and applied the sidechain block as canonical, which re-introduced the very phantom state reset_state was meant to clear. Observed on hoodi snapshotters running release/3.4 at block 2 818 468 (local kv.HeaderCanonical pointed at 0xac2ee57a… while the real canonical was 0x27db29e4…); the fix is being backported back to main as defense-in-depth, since although main's forkchoice unwind path (bfa03df + #21157) no longer accumulates stale pointers from new reorgs, any older datadirs upgraded from buggy versions can still carry leftover entries that survive reset_state. Add ResetCanonicalAboveTip that truncates kv.HeaderCanonical and TD from snapshotTip+1 forward, caps Headers/BlockHashes/Bodies/Senders progress at the tip and re-anchors HeadHeaderHash. ResetState now takes a frozenBlocks parameter and invokes the helper; the integration command passes br.FrozenBlocks() at the call site. Stage-progress writes only run when above the tip, so the routine is a safe no-op on a clean db. After the change the next forkchoice update from the consensus layer re-drives canonical assignments for the post-tip range fresh, with no chance for stale sidechain pointers to be re-applied on catchup. Unit tests in reset_stages_test.go cover both the stale-pointer cleanup and the idempotent no-op-at-tip case. Mirrors the corresponding r3.4 backport at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin
BlockCollector's parent-TD lookup after reset_state on hoodi:
[WARN] [BlockCollector] Failed to insert blocks
err="parent's total difficulty not found with hash 544f7113… and
height 2837709: <nil>"
TD records live under (hash, number) keys and exist for both canonical and
sidechain blocks at the same height. They are consulted by Caplin when it
imports a not-yet-canonical block to verify parent.TD. Wiping them
by-number across the post-tip range made every subsequent insert fail
until the headers were re-fetched from peers.
The stale TD records are independently keyed by hash and do not affect
canonical-hash assignment, so removing them was unnecessary for the
stale-canonical-pointer fix in the first place — drop the TruncateTd
call and document why.
Mirrors the corresponding r3.4 fixup at PR #21246.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Improves integration reset_state to also evict stale kv.HeaderCanonical entries above the snapshot tip and cap Headers/BlockHashes/Bodies/Senders stage progress to the tip. This protects datadirs created on older Erigon builds (where the pre-#21157 unwind bug could leak sidechain canonical pointers) from re-introducing phantom state on the next forward catchup. It is a no-op on a clean datadir.
Changes:
- Add
ResetCanonicalAboveTiphelper that truncateskv.HeaderCanonicalfromfrozenBlocks+1, re-anchorsHeadHeaderHashto the tip's canonical hash, and caps block-import stage progress at the tip. - Thread a new
frozenBlocksargument throughResetState, supplied by the sole production caller viabr.FrozenBlocks(). - Add unit tests covering the stale-pointer-clear and no-op-at-tip cases against
temporaltest.NewTestDB.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| execution/stagedsync/rawdbreset/reset_stages.go | Adds ResetCanonicalAboveTip and wires it into ResetState. |
| execution/stagedsync/rawdbreset/reset_stages_test.go | New unit tests for the helper. |
| cmd/integration/commands/reset_state.go | Passes br.FrozenBlocks() into the updated ResetState signature. |
Note: the top-level PR description's "Fix" bullet lists rawdb.TruncateTd(tx, frozenBlocks+1), but the code (and the in-file doc comment) deliberately do not truncate TD; the description should be reconciled with the implementation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func ResetCanonicalAboveTip(ctx context.Context, db kv.TemporalRwDB, frozenBlocks uint64) error { | ||
| return db.Update(ctx, func(tx kv.RwTx) error { | ||
| if err := rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false /* markChainAsBad */); err != nil { | ||
| return fmt.Errorf("truncate canonical hash above snapshot tip: %w", err) | ||
| } | ||
|
|
||
| if frozenBlocks > 0 { | ||
| tipHash, err := rawdb.ReadCanonicalHash(tx, frozenBlocks) | ||
| if err != nil { | ||
| return fmt.Errorf("read canonical hash at snapshot tip: %w", err) | ||
| } | ||
| if tipHash != (common.Hash{}) { | ||
| if err := rawdb.WriteHeadHeaderHash(tx, tipHash); err != nil { | ||
| return fmt.Errorf("re-anchor HeadHeaderHash to snapshot tip: %w", err) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| for _, st := range []stages.SyncStage{stages.Headers, stages.BlockHashes, stages.Bodies, stages.Senders} { | ||
| progress, err := stages.GetStageProgress(tx, st) | ||
| if err != nil { | ||
| return fmt.Errorf("read %s stage progress: %w", st, err) | ||
| } | ||
| if progress > frozenBlocks { | ||
| if err := stages.SaveStageProgress(tx, st, frozenBlocks); err != nil { | ||
| return fmt.Errorf("save %s stage progress: %w", st, err) | ||
| } | ||
| } | ||
| } | ||
| return nil | ||
| }) | ||
| } |
| // forward, truncates TD likewise, caps Headers/BlockHashes/Bodies/Senders | ||
| // stage progress at the snapshot tip, and re-anchors HeadHeaderHash to the | ||
| // tip's canonical hash. The next forkchoice update from CL then drives | ||
| // canonical assignments fresh, with no chance for stale sidechain pointers | ||
| // to survive. |
…arTable + FillDBFromSnapshots Per @AskAlexSharov review on #21246 (the r3.4 mirror of this PR): instead of open-coding a truncate-by-block-number + per-stage cap, follow the established pattern that stage_header --reset and stage_exec --reset use to rebuild canonical markers and stage progress from frozen snapshot files. Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and collapse its body to three steps: tx.ClearTable(kv.HeaderCanonical) clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots) FillDBFromSnapshots(...) // only when br.FrozenBlocks() > 0 ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing to fall through to (correct outcome — no canonical above the snapshot tip), and snapshot-range lookups fall through to the frozen segments already (see BlockReader.CanonicalHash). FillDBFromSnapshots then rewrites the snapshot-range markers, TD records and stage progress. ResetState takes (dirs, blockReader, logger) instead of a precomputed frozenBlocks count so it can hand them through to FillDBFromSnapshots. The integration command passes the existing blocksIO(db, logger) reader. Mirrors the corresponding r3.4 refresh at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Summary
Defense-in-depth cleanup of
integration reset_stateso it also evicts stalekv.HeaderCanonical/kv.HeaderTDentries above the snapshot tip and capsHeaders/BlockHashes/Bodies/Sendersstage progress at the tip. Without this, a stale canonical-hash pointer survivesreset_stateand steers the next forward catchup back onto a non-canonical block.Mirrors the
release/3.4backport at #21246; the file structure ofexecution/stagedsync/rawdbreset/reset_stages.goandcmd/integration/commands/reset_state.gois identical onmainandrelease/3.4, so the patch applies verbatim.Why it matters on main even after the recent unwind fixes
mainalready carries the two reorg-correctness fixes that prevent new stale canonical pointers from accumulating:bfa03df625("parallel-commitment correctness for reorg/unwind + SD recreate")After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh
mainbuild,kv.HeaderCanonicalshould only ever hold canonical hashes.However, datadirs created on older Erigon versions (including any
release/3.4build older than 2026-05-14) can carry leftover sidechain entries inkv.HeaderCanonicalfrom the time when the unwind bug was active. Today'sreset_stateclears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state thatreset_statewas meant to clear. Operators upgrading such a datadir to a freshmainbinary will hit the same symptom that bit therelease/3.4snapshotter on hoodi at block 2 818 476 (insufficient fundson a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta17 100 × 1 265 000 000 wei).This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh
maindatadir and fully recovers older / cross-version datadirs without requiringintegration stage_exec --unwind=Nor any other manual incantation.Fix
ResetCanonicalAboveTip(ctx, db, frozenBlocks):rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)rawdb.TruncateTd(tx, frozenBlocks+1)WriteHeadHeaderHashto canonical hash at the tip (when one is recorded)Headers/BlockHashes/Bodies/Sendersstage progress atfrozenBlocksResetStatetakes a newfrozenBlocks uint64parameter and invokes the helper afterResetExec. The only production caller,cmd/integration/commands/reset_state.go, passesbr.FrozenBlocks().Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip.
Test plan
execution/stagedsync/rawdbreset/reset_stages_test.goadds two cases against a realtemporaltest.NewTestDBbackend:TestResetCanonicalAboveTip_ClearsStaleSidechainPointers— seeds canonical entries for heights 100..110 (including an explicit non-zero0x99…at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a staleHeadHeaderHash. After the call withsnapshotTip = 100, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved,HeadHeaderHashis re-anchored to the tip's canonical hash, and each stage has been capped at 100.TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip— calls the helper on a db whose only canonical entry sits exactly at the tip and whoseHeadersstage already equals the tip; asserts the tip is preserved and stage progress is not regressed.go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...— green.go build ./cmd/integration/...— green (only production caller ofResetState).CI full unit-test + lint pipeline.
Related
release/3.4backport: [r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state #21246bfa03df625🤖 Generated with Claude Code