[r3.4] cmd/integration, execution/stagedsync: clear canonical hash above snapshot tip on reset_state#21246
Merged
Merged
Conversation
…pshot tip on ResetState reset_state historically wiped the MDBX state-domain tables and reset the Execution stage progress, but left kv.HeaderCanonical, kv.HeaderTD and the Headers/BlockHashes/Bodies/Senders stage progress alone. A stale canonical pointer at a height above the snapshot tip — typically a sidechain hash deposited by a successful older forkchoice update whose later replacement reorgs failed on execution and rolled back — survived the reset. After the restart, the forward catchup read kv.HeaderCanonical[stale_height] and applied the sidechain block as canonical, which re-introduced the very phantom state reset_state was meant to clear. Observed on hoodi snapshotters running release/3.4 at block 2 818 468 (local kv.HeaderCanonical pointed at 0xac2ee57a… while the real canonical was 0x27db29e4…). Add ResetCanonicalAboveTip that truncates kv.HeaderCanonical and TD from snapshotTip+1 forward, caps Headers/BlockHashes/Bodies/Senders progress at the tip and re-anchors HeadHeaderHash. ResetState now takes a frozenBlocks parameter and invokes the helper; the integration command passes br.FrozenBlocks() at the call site. Stage-progress writes only run when above the tip, so the routine is a safe no-op on a clean db. After the change the next forkchoice update from the consensus layer re-drives canonical assignments for the post-tip range fresh, with no chance for stale sidechain pointers to be re-applied on catchup. Unit tests in reset_stages_test.go cover both the stale-pointer cleanup and the idempotent no-op-at-tip case. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin
BlockCollector's parent-TD lookup after reset_state on hoodi:
[WARN] [BlockCollector] Failed to insert blocks
err="parent's total difficulty not found with hash 544f7113… and
height 2837709: <nil>"
TD records live under (hash, number) keys and exist for both canonical and
sidechain blocks at the same height. They are consulted by Caplin when it
imports a not-yet-canonical block to verify parent.TD. Wiping them
by-number across the post-tip range made every subsequent insert fail
until the headers were re-fetched from peers.
The stale TD records are independently keyed by hash and do not affect
canonical-hash assignment, so removing them was unnecessary for the
stale-canonical-pointer fix in the first place — drop the TruncateTd
call and document why.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
JkLondon
pushed a commit
that referenced
this pull request
May 18, 2026
Wiping kv.HeaderTD by-number above the snapshot tip broke the Caplin
BlockCollector's parent-TD lookup after reset_state on hoodi:
[WARN] [BlockCollector] Failed to insert blocks
err="parent's total difficulty not found with hash 544f7113… and
height 2837709: <nil>"
TD records live under (hash, number) keys and exist for both canonical and
sidechain blocks at the same height. They are consulted by Caplin when it
imports a not-yet-canonical block to verify parent.TD. Wiping them
by-number across the post-tip range made every subsequent insert fail
until the headers were re-fetched from peers.
The stale TD records are independently keyed by hash and do not affect
canonical-hash assignment, so removing them was unnecessary for the
stale-canonical-pointer fix in the first place — drop the TruncateTd
call and document why.
Mirrors the corresponding r3.4 fixup at PR #21246.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AskAlexSharov
requested changes
May 18, 2026
| return fmt.Errorf("read %s stage progress: %w", st, err) | ||
| } | ||
| if progress > frozenBlocks { | ||
| if err := stages.SaveStageProgress(tx, st, frozenBlocks); err != nil { |
Collaborator
There was a problem hiding this comment.
there is FillDBFromSnapshots which does set correct values of stages.
don't need copy that biz-logic here.
i think most robust solution is - we already doing it (in stage_header --reset and stage_exec --reset):
tx.ClearTable(CanonicalTable)
clearStageProgress(stages)
Call FillDBFromSnapshots()
Collaborator
There was a problem hiding this comment.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates integration reset_state so state resets also clear stale post-snapshot canonical pointers that can cause forward execution to replay sidechain blocks after a rebuild.
Changes:
- Adds
ResetCanonicalAboveTipto truncateHeaderCanonicalabove the frozen block tip, cap related stage progress, and re-anchorHeadHeaderHash. - Passes the block snapshot tip from the integration command into
ResetState. - Adds regression tests for stale canonical pointer cleanup and idempotent no-op behavior at the tip.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
execution/stagedsync/rawdbreset/reset_stages.go |
Extends reset-state behavior with post-tip canonical cleanup and stage progress capping. |
execution/stagedsync/rawdbreset/reset_stages_test.go |
Adds tests covering stale canonical cleanup and already-at-tip behavior. |
cmd/integration/commands/reset_state.go |
Supplies frozen block count from the block reader to reset-state cleanup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+100
to
+109
| if frozenBlocks > 0 { | ||
| tipHash, err := rawdb.ReadCanonicalHash(tx, frozenBlocks) | ||
| if err != nil { | ||
| return fmt.Errorf("read canonical hash at snapshot tip: %w", err) | ||
| } | ||
| if tipHash != (common.Hash{}) { | ||
| if err := rawdb.WriteHeadHeaderHash(tx, tipHash); err != nil { | ||
| return fmt.Errorf("re-anchor HeadHeaderHash to snapshot tip: %w", err) | ||
| } | ||
| } |
Comment on lines
+47
to
+51
| // forward, truncates TD likewise, caps Headers/BlockHashes/Bodies/Senders | ||
| // stage progress at the snapshot tip, and re-anchors HeadHeaderHash to the | ||
| // tip's canonical hash. The next forkchoice update from CL then drives | ||
| // canonical assignments fresh, with no chance for stale sidechain pointers | ||
| // to survive. |
…arTable + FillDBFromSnapshots Per @AskAlexSharov review on #21246: instead of open-coding a truncate- by-block-number + per-stage cap, follow the established pattern that stage_header --reset and stage_exec --reset use to rebuild canonical markers and stage progress from frozen snapshot files. Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and collapse its body to three steps: tx.ClearTable(kv.HeaderCanonical) clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots) FillDBFromSnapshots(...) // only when br.FrozenBlocks() > 0 ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing to fall through to (correct outcome — no canonical above the snapshot tip), and snapshot-range lookups fall through to the frozen segments already (see BlockReader.CanonicalHash). FillDBFromSnapshots then rewrites the snapshot-range markers, TD records and stage progress. ResetState takes (dirs, blockReader, logger) instead of a precomputed frozenBlocks count so it can hand them through to FillDBFromSnapshots. The integration command passes the existing blocksIO(db, logger) reader. Also addresses Copilot review comments: - kv.HeaderTD is not truncated by ResetCanonicalAndRefillFromSnapshots, so the test no longer mentions TD. - HeadHeaderHash re-anchoring is handled inside FillDBFromSnapshots (or naturally on the next forkchoice update when no snapshots exist), so the genesis-edge case no longer needs special handling here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
JkLondon
pushed a commit
that referenced
this pull request
May 19, 2026
…arTable + FillDBFromSnapshots Per @AskAlexSharov review on #21246 (the r3.4 mirror of this PR): instead of open-coding a truncate-by-block-number + per-stage cap, follow the established pattern that stage_header --reset and stage_exec --reset use to rebuild canonical markers and stage progress from frozen snapshot files. Rename ResetCanonicalAboveTip to ResetCanonicalAndRefillFromSnapshots and collapse its body to three steps: tx.ClearTable(kv.HeaderCanonical) clearStageProgress(Headers, BlockHashes, Bodies, Senders, Snapshots) FillDBFromSnapshots(...) // only when br.FrozenBlocks() > 0 ClearTable on kv.HeaderCanonical is safe: post-tip lookups have nothing to fall through to (correct outcome — no canonical above the snapshot tip), and snapshot-range lookups fall through to the frozen segments already (see BlockReader.CanonicalHash). FillDBFromSnapshots then rewrites the snapshot-range markers, TD records and stage progress. ResetState takes (dirs, blockReader, logger) instead of a precomputed frozenBlocks count so it can hand them through to FillDBFromSnapshots. The integration command passes the existing blocksIO(db, logger) reader. Mirrors the corresponding r3.4 refresh at PR #21246. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
|
||
| func ResetState(db kv.TemporalRwDB, ctx context.Context) error { | ||
| func ResetState(db kv.TemporalRwDB, ctx context.Context, dirs datadir.Dirs, br services.FullBlockReader, logger log.Logger) error { | ||
| // don't reset senders here |
Comment on lines
+93
to
+100
| func ResetCanonicalAndRefillFromSnapshots(ctx context.Context, db kv.TemporalRwDB, dirs datadir.Dirs, br services.FullBlockReader, logger log.Logger) error { | ||
| return db.Update(ctx, func(tx kv.RwTx) error { | ||
| if err := tx.ClearTable(kv.HeaderCanonical); err != nil { | ||
| return fmt.Errorf("clear canonical hash table: %w", err) | ||
| } | ||
| if err := clearStageProgress(tx, stages.Headers, stages.BlockHashes, stages.Bodies, stages.Senders, stages.Snapshots); err != nil { | ||
| return fmt.Errorf("clear chain-data stage progress: %w", err) | ||
| } |
Comment on lines
+101
to
+104
| if br.FrozenBlocks() > 0 { | ||
| if err := FillDBFromSnapshots("reset_state_fill_db_from_snapshots", ctx, tx, dirs, br, logger); err != nil { | ||
| return fmt.Errorf("refill canonical markers from snapshots: %w", err) | ||
| } |
Comment on lines
+38
to
+48
| // newEmptyBlockReader returns a BlockReader backed by an empty RoSnapshots | ||
| // instance, so FrozenBlocks() == 0 and the reset helper takes its | ||
| // "no-snapshots, skip FillDBFromSnapshots" path. FillDBFromSnapshots itself | ||
| // is exercised by stage_header --reset / stage_exec --reset integration | ||
| // paths and is out of scope for this unit test. | ||
| func newEmptyBlockReader(t *testing.T, dirs datadir.Dirs, logger log.Logger) *freezeblocks.BlockReader { | ||
| t.Helper() | ||
| snaps := freezeblocks.NewRoSnapshots(ethconfig.BlocksFreezing{ChainName: networkname.Mainnet}, dirs.Snap, logger) | ||
| t.Cleanup(snaps.Close) | ||
| return freezeblocks.NewBlockReader(snaps, nil) | ||
| } |
AskAlexSharov
approved these changes
May 21, 2026
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 21, 2026
…pshot tip on reset_state (erigontech#21247) ## Summary Defense-in-depth cleanup of `integration reset_state` so it also evicts stale `kv.HeaderCanonical` / `kv.HeaderTD` entries above the snapshot tip and caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at the tip. Without this, a stale canonical-hash pointer survives `reset_state` and steers the next forward catchup back onto a non-canonical block. Mirrors the `release/3.4` backport at erigontech#21246; the file structure of `execution/stagedsync/rawdbreset/reset_stages.go` and `cmd/integration/commands/reset_state.go` is identical on `main` and `release/3.4`, so the patch applies verbatim. ## Why it matters on main even after the recent unwind fixes `main` already carries the two reorg-correctness fixes that prevent **new** stale canonical pointers from accumulating: - `bfa03df625` ("parallel-commitment correctness for reorg/unwind + SD recreate") - erigontech#21157 ("execution/stagedsync: find diffset by actually-executed hash on unwind") After those changes, a forkchoice update whose forward execution fails no longer leaks state because the unwind correctly reverts the previously-applied sidechain. So in steady state on a fresh `main` build, `kv.HeaderCanonical` should only ever hold canonical hashes. However, datadirs created on older Erigon versions (including any `release/3.4` build older than 2026-05-14) can carry leftover sidechain entries in `kv.HeaderCanonical` from the time when the unwind bug was active. Today's `reset_state` clears MDBX domain state but leaves those entries alone — a forward catchup then reads them as authoritative and re-introduces the very phantom state that `reset_state` was meant to clear. Operators upgrading such a datadir to a fresh `main` binary will hit the same symptom that bit the `release/3.4` snapshotter on hoodi at block 2 818 476 (`insufficient funds` on a fee-recipient sweep tx, off by exactly the EIP-2929 SSTORE-SET-vs-RESET delta `17 100 × 1 265 000 000 wei`). This PR is therefore a no-cost forward-compat improvement: it harmlessly no-ops on a fresh `main` datadir and fully recovers older / cross-version datadirs without requiring `integration stage_exec --unwind=N` or any other manual incantation. ## Fix `ResetCanonicalAboveTip(ctx, db, frozenBlocks)`: - `rawdb.TruncateCanonicalHash(tx, frozenBlocks+1, false)` - `rawdb.TruncateTd(tx, frozenBlocks+1)` - `WriteHeadHeaderHash` to canonical hash at the tip (when one is recorded) - Caps `Headers`/`BlockHashes`/`Bodies`/`Senders` stage progress at `frozenBlocks` `ResetState` takes a new `frozenBlocks uint64` parameter and invokes the helper after `ResetExec`. The only production caller, `cmd/integration/commands/reset_state.go`, passes `br.FrozenBlocks()`. Snapshot-tip data is by construction immutable, so leaving the canonical-hash table contents at-or-below the tip untouched is correct; stage-progress writes only run when an individual stage is observably above the tip; the routine is a no-op when nothing is above the tip. ## Test plan `execution/stagedsync/rawdbreset/reset_stages_test.go` adds two cases against a real `temporaltest.NewTestDB` backend: - `TestResetCanonicalAboveTip_ClearsStaleSidechainPointers` — seeds canonical entries for heights 100..110 (including an explicit non-zero `0x99…` at 105 to distinguish "missing" from "zero"), sets all four stage progresses to 110 and writes a stale `HeadHeaderHash`. After the call with `snapshotTip = 100`, asserts canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved, `HeadHeaderHash` is re-anchored to the tip's canonical hash, and each stage has been capped at 100. - `TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTip` — calls the helper on a db whose only canonical entry sits exactly at the tip and whose `Headers` stage already equals the tip; asserts the tip is preserved and stage progress is not regressed. - [x] `go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...` — green. - [x] `go build ./cmd/integration/...` — green (only production caller of `ResetState`). - [ ] CI full unit-test + lint pipeline. ## Related - `release/3.4` backport: erigontech#21246 - Reorg-unwind correctness on r3.4: erigontech#21157 - Reorg-unwind correctness on main: `bfa03df625` 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: JkLondon <me@ilyamikheev.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a stale-canonical-pointer leak in
integration reset_statethat was reintroducing phantom state on hoodi snapshotters runningrelease/3.4even after a clean rebuild. Observed at block 2 818 468 onsnapshotter-bm-v34-hoodi-n1: localkv.HeaderCanonical[2 818 468] = 0xac2ee57a…(a sidechain block) while the real canonical was0x27db29e4…. The next canonical block, 2 818 476, then died withinsufficient funds for gas * price + valueon0xdD11751cdD3f6EFf01B1f6151B640685bfa5dB4a— a fee-recipient sweep-tx that was short by exactly17 100 × 1 265 000 000 = 21 631 500 000 000 wei, the EIP-2929 SSTORE-SET-vs-RESET delta for a single storage slot on0x8684adde…that had been first-written by the surviving sidechain block 8 blocks earlier.This is not a cherry-pick of a merged PR. The fix originates on
release/3.4and pairs with #21157 (already onrelease/3.4).Root cause
reset_statehistorically calledResetExec, which clears the MDBX state-domain tables (AccountsDomain,StorageDomain,CodeDomain,CommitmentDomain,ReceiptDomain,RCacheDomain), the inverted indexes andChangeSets3, plus theExecution/Finish/CustomTrace/TxLookup/Witnessesstage progress. It deliberately leftkv.HeaderCanonical,kv.HeaderTDand theHeaders/BlockHashes/Bodies/Sendersstage progress alone, on the assumption that those tables faithfully reflect the consensus-layer view of the chain.That assumption breaks once a forkchoice update has ever committed a canonical hash whose chain later turned out to be the wrong side of a reorg. Concretely, on the affected snapshotter:
WriteCanonicalHash(A, 2 818 468)was committed.newCanonicalsincludingB@2 818 468, ran the staged unwind and reached the forwardexecutionPipeline.Runstep.CodeDomain/StorageDomainwrites made by A (seeexecution/stagedsync/stage_execute.go:findExecutedDiffsetAtHeightintroduced in [r3.4] execution/stagedsync: find diffset by actually-executed hash on unwind #21157). The forkchoice handler returnedBadBlockand the whole tx — including theWriteCanonicalHash(B, 2 818 468)issued atexecution/execmodule/forkchoice.go:381— rolled back.kv.HeaderCanonical[2 818 468]remained at A.integration reset_state.ResetExeccleared the state domains, but the stalekv.HeaderCanonical[2 818 468] = Asurvived.block 2 810 162, txNum 100 781 249, end of step 258). The execution stage read each block's body using its canonical hash, hit the stale pointer at 2 818 468, re-applied chain A's block as canonical and re-installed the phantom code+slot writes on0x8684adde…. The subsequent canonical tx 3 of block 2 818 476 then took the SSTORE-RESET path instead of SSTORE-SET — saving exactly 17 100 gas and starving the fee-recipient sweep at tx 14.seg integrity --check=CommitmentRooton the same datadir is green (bothv2.0-commitment.0-256.kvandv2.0-commitment.256-258.kvrecompute matching roots), confirming that no published.kvis contaminated. The phantom lived purely in MDBX, fed bykv.HeaderCanonicalwhose stale entriesreset_statewas failing to evict.Fix
ResetCanonicalAboveTip(ctx, db, frozenBlocks)truncateskv.HeaderCanonicalandkv.HeaderTDfromfrozenBlocks+1forward, capsHeaders/BlockHashes/Bodies/Sendersstage progress atfrozenBlocks, and re-anchorsHeadHeaderHashto the canonical hash recorded at the tip (when one is present).ResetStatetakes a newfrozenBlocksparameter and invokes the helper afterResetExec.cmd/integration/commands/reset_state.gopassesbr.FrozenBlocks().Why this is safe to roll into
ResetState:≤ frozenBlockslive in the frozen blocks themselves, so leaving the table contents below the tip untouched is correct.The forkchoice-handler logic itself (
execution/execmodule/forkchoice.go) and thefindExecutedDiffsetAtHeightunwind path from #21157 are untouched. This PR addresses the persistent-leftover side of the problem; future reorgs on a node carrying #21157 will unwind cleanly and won't accumulate the leftover.Test plan
execution/stagedsync/rawdbreset/reset_stages_test.goadds two cases:TestResetCanonicalAboveTip_ClearsStaleSidechainPointersseedskv.HeaderCanonicalentries for heights 100..110 (including an explicit non-zero0x99…at height 105 to make the assertion distinguish "missing" from "zero"), sets the four stage progresses to 110 and writes a staleHeadHeaderHash. After the call withsnapshotTip = 100, the test asserts that canonical hashes at 101..110 are gone, the tip's canonical hash at 100 is preserved,HeadHeaderHashhas been re-anchored, and each of the four stages has been capped at 100.TestResetCanonicalAboveTip_NoOpWhenAlreadyAtTipcalls the helper on a db whose only canonical entry sits exactly at the tip and whoseHeadersstage already equals the tip; the test asserts the tip is preserved and stage progress is not regressed.Both tests use
temporaltest.NewTestDBfor a real MDBX/temporal backend, run in well under one second, and exercise the public helper directly without needing a full ExecModule.go test -count=1 -timeout=120s ./execution/stagedsync/rawdbreset/...— green.go test -count=1 -timeout=300s ./db/test/...— green (covers existing callers ofResetExec).go build ./cmd/integration/...— green (only production caller ofResetState).integration reset_state --datadir=<DD> --chain=hoodion the affected snapshotter; verifykv.HeaderCanonical[2 818 468]is cleared, restart, watch catchup pass2 818 476cleanly, runseg integrity --check=CommitmentRootbefore re-enabling snapshot publication.🤖 Generated with Claude Code