Skip to content

execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state#21247

Merged
AskAlexSharov merged 3 commits into
mainfrom
fix/canonical-pointer-reset-state
May 21, 2026
Merged

execution/stagedsync, cmd/integration: clear canonical hash above snapshot tip on reset_state#21247
AskAlexSharov merged 3 commits into
mainfrom
fix/canonical-pointer-reset-state

Conversation

@JkLondon

Copy link
Copy Markdown
Member

Summary

Defense-in-depth cleanup of integration reset_state so it also evicts stale kv.HeaderCanonical / kv.HeaderTD entries above the snapshot tip and caps Headers/BlockHashes/Bodies/Senders stage progress at the tip. Without this, a stale canonical-hash pointer survives reset_state and steers the next forward catchup back onto a non-canonical block.

Mirrors the release/3.4 backport at #21246; the file structure of execution/stagedsync/rawdbreset/reset_stages.go and cmd/integration/commands/reset_state.go is identical on main and release/3.4, so the patch applies verbatim.

Why it matters on main even after the recent unwind fixes

main already carries the two reorg-correctness fixes that prevent new stale canonical pointers from accumulating:

After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh main build, kv.HeaderCanonical should only ever hold canonical hashes.

However, datadirs created on older Erigon versions (including any release/3.4 build older than 2026-05-14) can carry leftover sidechain entries in kv.HeaderCanonical from the time when the unwind bug was active. Today's reset_state clears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state that reset_state was meant to clear. Operators upgrading such a datadir to a fresh main binary will hit the same symptom that bit the release/3.4 snapshotter on hoodi at block 2 818 476 (insufficient funds on a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta 17 100 × 1 265 000 000 wei).

This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh main datadir and fully recovers older / cross-version datadirs without requiring integration stage_exec --unwind=N or any other manual incantation.

Fix

ResetCanonicalAboveTip(ctx, db, frozenBlocks):

  • rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)
  • rawdb.TruncateTd(tx, frozenBlocks+1)
  • WriteHeadHeaderHash to canonical hash at the tip (when one is recorded)
  • Caps Headers/BlockHashes/Bodies/Senders stage progress at frozenBlocks

ResetState takes a new frozenBlocks uint64 parameter and invokes the helper after ResetExec. The only production caller, cmd/integration/commands/reset_state.go, passes br.FrozenBlocks().

Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip.

Test plan

execution/stagedsync/rawdbreset/reset_stages_test.go adds two cases against a real temporaltest.NewTestDB backend:

  • TestResetCanonicalAboveTip_ClearsStaleSidechainPointers — seeds canonical entries for heights 100..110 (including an explicit non-zero 0x99… at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a stale HeadHeaderHash. After the call with snapshotTip = 100, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, HeadHeaderHash is re-anchored to the tip's canonical hash, and each stage has been capped at 100.

  • TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip — calls the helper on a db whose only canonical entry sits exactly at the tip and whose Headers stage already equals the tip; asserts the tip is preserved and stage progress is not regressed.

  • go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/... — green.

  • go build ./cmd/integration/... — green (only production caller of ResetState).

  • CI full unit-test + lint pipeline.

Related

🤖 Generated with Claude Code

…pshot tip on ResetState

reset_state historically wiped the MDBX state-domain tables and reset the
Execution stage progress, but left kv.HeaderCanonical, kv.HeaderTD and the
Headers/BlockHashes/Bodies/Senders stage progress alone. A stale canonical
pointer at a height above the snapshot tip — typically a sidechain hash
deposited by a successful older forkchoice update whose later replacement
reorgs failed on execution and rolled back — survived the reset. After the
restart, the forward catchup read kv.HeaderCanonical[stale_height] and
applied the sidechain block as canonical, which re-introduced the very
phantom state reset_state was meant to clear. Observed on hoodi snapshotters
running release/3.4 at block 2 818 468 (local kv.HeaderCanonical pointed at
0xac2ee57a… while the real canonical was 0x27db29e4…); the fix is being
backported back to main as defense-in-depth, since although main's
forkchoice unwind path (bfa03df + #21157) no longer accumulates stale
pointers from new reorgs, any older datadirs upgraded from buggy versions
can still carry leftover entries that survive reset_state.

Add ResetCanonicalAboveTip that truncates kv.HeaderCanonical and TD from
snapshotTip+1 forward, caps Headers/BlockHashes/Bodies/Senders progress at
the tip and re-anchors HeadHeaderHash. ResetState now takes a frozenBlocks
parameter and invokes the helper; the integration command passes
br.FrozenBlocks() at the call site. Stage-progress writes only run when
above the tip, so the routine is a safe no-op on a clean db.

After the change the next forkchoice update from the consensus layer
re-drives canonical assignments for the post-tip range fresh, with no
chance for stale sidechain pointers to be re-applied on catchup.

Unit tests in reset_stages_test.go cover both the stale-pointer cleanup
and the idempotent no-op-at-tip case. Mirrors the corresponding r3.4
backport at PR #21246.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin
BlockCollector's parent-TD lookup after reset_state on hoodi:

  [WARN] [BlockCollector] Failed to insert blocks
    err="parent's total difficulty not found with hash 544f7113… and
         height 2837709: <nil>"

TD records live under (hash, number) keys and exist for both canonical and
sidechain blocks at the same height. They are consulted by Caplin when it
imports a not-yet-canonical block to verify parent.TD. Wiping them
by-number across the post-tip range made every subsequent insert fail
until the headers were re-fetched from peers.

The stale TD records are independently keyed by hash and do not affect
canonical-hash assignment, so removing them was unnecessary for the
stale-canonical-pointer fix in the first place — drop the TruncateTd
call and document why.

Mirrors the corresponding r3.4 fixup at PR #21246.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves integration reset_state to also evict stale kv.HeaderCanonical entries above the snapshot tip and cap Headers/BlockHashes/Bodies/Senders stage progress to the tip. This protects datadirs created on older Erigon builds (where the pre-#21157 unwind bug could leak sidechain canonical pointers) from re-introducing phantom state on the next forward catchup. It is a no-op on a clean datadir.

Changes:

  • Add ResetCanonicalAboveTip helper that truncates kv.HeaderCanonical from frozenBlocks+1, re-anchors HeadHeaderHash to the tip's canonical hash, and caps block-import stage progress at the tip.
  • Thread a new frozenBlocks argument through ResetState, supplied by the sole production caller via br.FrozenBlocks().
  • Add unit tests covering the stale-pointer-clear and no-op-at-tip cases against temporaltest.NewTestDB.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
execution/stagedsync/rawdbreset/reset_stages.go Adds ResetCanonicalAboveTip and wires it into ResetState.
execution/stagedsync/rawdbreset/reset_stages_test.go New unit tests for the helper.
cmd/integration/commands/reset_state.go Passes br.FrozenBlocks() into the updated ResetState signature.

Note: the top-level PR description's "Fix" bullet lists rawdb.TruncateTd(tx, frozenBlocks+1), but the code (and the in-file doc comment) deliberately do not truncate TD; the description should be reconciled with the implementation.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +94 to +125
func ResetCanonicalAboveTip(ctx context.Context, db kv.TemporalRwDB, frozenBlocks uint64) error {
return db.Update(ctx, func(tx kv.RwTx) error {
if err := rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false /* markChainAsBad */); err != nil {
return fmt.Errorf("truncate canonical hash above snapshot tip: %w", err)
}

if frozenBlocks > 0 {
tipHash, err := rawdb.ReadCanonicalHash(tx, frozenBlocks)
if err != nil {
return fmt.Errorf("read canonical hash at snapshot tip: %w", err)
}
if tipHash != (common.Hash{}) {
if err := rawdb.WriteHeadHeaderHash(tx, tipHash); err != nil {
return fmt.Errorf("re-anchor HeadHeaderHash to snapshot tip: %w", err)
}
}
}

for _, st := range []stages.SyncStage{stages.Headers, stages.BlockHashes, stages.Bodies, stages.Senders} {
progress, err := stages.GetStageProgress(tx, st)
if err != nil {
return fmt.Errorf("read %s stage progress: %w", st, err)
}
if progress > frozenBlocks {
if err := stages.SaveStageProgress(tx, st, frozenBlocks); err != nil {
return fmt.Errorf("save %s stage progress: %w", st, err)
}
}
}
return nil
})
}
Comment on lines +47 to +51
// forward, truncates TD likewise, caps Headers/BlockHashes/Bodies/Senders
// stage progress at the snapshot tip, and re-anchors HeadHeaderHash to the
// tip's canonical hash. The next forkchoice update from CL then drives
// canonical assignments fresh, with no chance for stale sidechain pointers
// to survive.
…arTable + FillDBFromSnapshots

Per @AskAlexSharov review on #21246 (the r3.4 mirror of this PR):
instead of open-coding a truncate-by-block-number + per-stage cap,
follow the established pattern that stage_header --reset and
stage_exec --reset use to rebuild canonical markers and stage progress
from frozen snapshot files.

Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots
and collapse its body to three steps:

  tx.ClearTable(kv.HeaderCanonical)
  clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots)
  FillDBFromSnapshots(...)            // only when br.FrozenBlocks() > 0

ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing
to fall through to (correct outcome — no canonical above the snapshot
tip), and snapshot-range lookups fall through to the frozen segments
already (see BlockReader.CanonicalHash). FillDBFromSnapshots then
rewrites the snapshot-range markers, TD records and stage progress.

ResetState takes (dirs, blockReader, logger) instead of a precomputed
frozenBlocks count so it can hand them through to FillDBFromSnapshots.
The integration command passes the existing blocksIO(db, logger) reader.

Mirrors the corresponding r3.4 refresh at PR #21246.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@AskAlexSharov AskAlexSharov added this pull request to the merge queue May 21, 2026
Merged via the queue into main with commit 949950c May 21, 2026
69 checks passed
@AskAlexSharov AskAlexSharov deleted the fix/canonical-pointer-reset-state branch May 21, 2026 08:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants