re-introduce block catchup recovery#21415
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR re-introduces a “block catchup” recovery path by teaching forkchoice to skip unwinding when chaindata canonical (TxNums) lags behind state, and removes prior snapshot-stage/state-collation mechanisms that attempted to handle “state ahead of blocks”.
Changes:
- Update forkchoice (FCU) to skip the unwind path when
unwindTargetis at/above the current canonical tip derived fromTxNums.Last, and proceed with canonicalization + TxNums forward-fill instead. - Remove the “state ahead of blocks” mitigation hooks:
FrozenBlocksProvider/SetFrozenBlocksProvider/MaxCollatableTxNumand the snapshot-stagealignStateToBlockSnapshotsdeletion-based recovery. - Remove associated tests and wiring in node setup / integration tooling that configured the removed cap provider.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| node/eth/backend.go | Stops configuring the aggregator with the removed SetFrozenBlocksProvider hook. |
| execution/stagedsync/stage_snapshots.go | Removes snapshot-stage state/block alignment deletion loop; keeps snapshot indexing/DB fill path. |
| execution/stagedsync/stage_snapshots_test.go | Deletes unit tests that only covered the removed alignment helpers. |
| execution/execmodule/forkchoice.go | Adds FCU recovery logic to skip unwind when TxNums canonical lags and then forward-fill canonicals/TxNums. |
| db/test/aggregator_ext_test.go | Removes tests tied to the removed FrozenBlocksProvider cap behavior. |
| db/state/aggregator.go | Removes FrozenBlocksProvider API and cap logic; adjusts readyForCollation logging/behavior accordingly. |
| db/services/snapshot_progress.go | Removes MaxCollatableTxNum helper that depended on the removed cap design. |
| cmd/utils/app/snapshots_cmd.go | Removes wiring of SetFrozenBlocksProvider in snapshot/retire command path. |
| cmd/integration/commands/stages.go | Removes wiring of SetFrozenBlocksProvider in integration snapshot setup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AskAlexSharov
approved these changes
May 27, 2026
taratorio
reviewed
May 27, 2026
taratorio
requested changes
May 27, 2026
taratorio
reviewed
May 27, 2026
taratorio
approved these changes
May 27, 2026
Member
Author
taratorio
reviewed
May 29, 2026
| // FCU's commit defers complete — so do one more idempotent FCU for the | ||
| // last block to ensure commitBlock has settled before the caller reads it. | ||
| if len(blocks) > 0 { | ||
| if _, err := updateForkChoice(ctx, exec, blocks[len(blocks)-1].Header()); err != nil { |
sudeepdino008
added a commit
that referenced
this pull request
Jun 3, 2026
- Merge `origin/main` (up to #21546) into the `performance` branch. - Conflicts resolved by taking main's finalized form where the perf branch was behind (`ExistenceFilterVersion`→1 per #21164, `mdbx-go`→v0.40.1, `merge.go` `findMergeRangeInFiles` refactor, dropped `erigon-snapshot` module dep, fusefilter deferred-close refactor, `Versions.MustSupport`, atomic per-key prune throttle). - Adopted main's collation-at-tip design (`CollateAndPrune` in the FCU path, #21398/#21415) and removed the perf branch's older `frozenBlocks`-gating (`SetFrozenBlocksProvider`/`MaxCollatableTxNum`, `db/services/snapshot_progress.go`, its gating tests, and callers). - Verified: `make erigon integration` build, `make lint` (clean), `make test-short` (green).
awskii
added a commit
that referenced
this pull request
Jun 4, 2026
Resolve 3 conflicts from main's overlapping refactors: - db/state/aggregator.go: drop frozenBlocks/FrozenBlocksProvider (removed repo-wide by #21415 block-catchup-recovery), keep commitmentRefsOverride. - db/state/domain_committed.go: keep both the PR's reshorten gate and #21303's per-merge findShortenedKey caches; the vt closure uses all three. - db/state/squeeze.go: keep the PR rename ForTestReferencesInCommitmentBranches and the wantsReferencesInBranches guard over main's old ForTestReplaceKeysInValues.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Issue: #20701
commitBlockis ahead of chaindata canonical (TxNums.Last) — e.g. preverified snapshots ship state ahead of blocks — an FCU'sunwindTarget(parent of head) lands at-or-above canonical, so there's nothing above to roll back. The old code still entered the unwind branch and was rejected asReorgTooDeep(becauseminUnwindableis bounded by state files past canonical). Forkchoice now skips the unwind path in this case and just canonicalizes the new blocks (WriteCanonicalHash+AppendCanonicalTxNums), soTxNumscatches up to state instead of state being unwound.Remove the two other solutions in code for "state ahead of blocks":
FrozenBlocksProvidercap +SetFrozenBlocksProvider+MaxCollatableTxNumhelper removed.readyForCollationnow uses only thereorgBlockDepthgate.alignStateToBlockSnapshotsrecovery (state-files-deletion whencommitBlock > frozenBlocks) removed from snapshot stage — block catchup via forkchoice replaces it.Tests
Unit (
execution/execmodule/exec_module_test.go)Both reproduce
commitBlock > TxNums.Lastby truncating canonical +TxNumsand clearingChangeSets3(soCanUnwindToBlockNumhitsthe commitment-block fallback), while leaving the committed domain state in place — the same shape as a chaindata wipe with state
still in snapshot files. Both are confirmed red→green (fail as
ReorgTooDeepwith the no-unwind branch reverted).TestUpdateForkChoiceRecoversWhenStateAheadOfTxNums— index repair: state already at the head. Asserts the FCU returnsExecutionStatusTooFarAway(notReorgTooDeep) andTxNumsis re-extended tocommitBlock.TestUpdateForkChoiceForwardExecutesAfterStateAheadRecovery— forward drive: state executed to block 10, chain extends to 15.First FCU repairs the index and returns
TooFarAway; second FCU drives execution forward (execProgress10 → 15) and returnsSuccess.Manual (mainnet-minimal)
state ahead of blocksreproduced by deleting block-snapshot files and wiping chaindata. Three scenarios:Small gap, Caplin. State 570 blocks ahead. Caplin filled the gap via
InsertBlocks, no-unwind path activated,AppendCanonicalTxNumsextendedTxNumspastcommitBlock, execution resumed atcommitBlock+1, block retire rebuilt the deletedsnapshots. 0
ReorgTooDeep.Deep gap (~16k blocks), Caplin. Without the no-unwind fix the system deadlocks (~90+ FCUs rejected
ReorgTooDeep). With it: 0ReorgTooDeep, execution started ~5 min after restart, full saw-tooth recovery.Deep gap (~17k blocks), Lighthouse (
--externalcl). Recovery path goes throughEngineBlockDownloaderinstead ofBackwardBeaconDownloaderbut lands in the same forkchoice. 0ReorgTooDeep, execution started ~8 min after restart.behind commitmentfires 1–3 times during the catch-up window in each case, then stops onceTxNumsovertakescommitBlock.