Skip to content

re-introduce block catchup recovery#21415

Merged
sudeepdino008 merged 10 commits into
mainfrom
collation_cont
May 29, 2026
Merged

re-introduce block catchup recovery#21415
sudeepdino008 merged 10 commits into
mainfrom
collation_cont

Conversation

@sudeepdino008

@sudeepdino008 sudeepdino008 commented May 26, 2026

Copy link
Copy Markdown
Member

Issue: #20701

  • When state's commitBlock is ahead of chaindata canonical (TxNums.Last) — e.g. preverified snapshots ship state ahead of blocks — an FCU's unwindTarget (parent of head) lands at-or-above canonical, so there's nothing above to roll back. The old code still entered the unwind branch and was rejected as ReorgTooDeep (because minUnwindable is bounded by state files past canonical). Forkchoice now skips the unwind path in this case and just canonicalizes the new blocks (WriteCanonicalHash + AppendCanonicalTxNums), so TxNums catches up to state instead of state being unwound.

Remove the two other solutions in code for "state ahead of blocks":

  • FrozenBlocksProvider cap + SetFrozenBlocksProvider + MaxCollatableTxNum helper removed. readyForCollation now uses only the reorgBlockDepth gate.

  • alignStateToBlockSnapshots recovery (state-files-deletion when commitBlock > frozenBlocks) removed from snapshot stage — block catchup via forkchoice replaces it.

    Tests

    Unit (execution/execmodule/exec_module_test.go)

    Both reproduce commitBlock > TxNums.Last by truncating canonical + TxNums and clearing ChangeSets3 (so CanUnwindToBlockNum hits
    the commitment-block fallback), while leaving the committed domain state in place — the same shape as a chaindata wipe with state
    still in snapshot files. Both are confirmed red→green (fail as ReorgTooDeep with the no-unwind branch reverted).

    • TestUpdateForkChoiceRecoversWhenStateAheadOfTxNums — index repair: state already at the head. Asserts the FCU returns
      ExecutionStatusTooFarAway (not ReorgTooDeep) and TxNums is re-extended to commitBlock.
    • TestUpdateForkChoiceForwardExecutesAfterStateAheadRecovery — forward drive: state executed to block 10, chain extends to 15.
      First FCU repairs the index and returns TooFarAway; second FCU drives execution forward (execProgress 10 → 15) and returns
      Success.

    Manual (mainnet-minimal)

    state ahead of blocks reproduced by deleting block-snapshot files and wiping chaindata. Three scenarios:

    1. Small gap, Caplin. State 570 blocks ahead. Caplin filled the gap via InsertBlocks, no-unwind path activated,
      AppendCanonicalTxNums extended TxNums past commitBlock, execution resumed at commitBlock+1, block retire rebuilt the deleted
      snapshots. 0 ReorgTooDeep.

    2. Deep gap (~16k blocks), Caplin. Without the no-unwind fix the system deadlocks (~90+ FCUs rejected ReorgTooDeep). With it: 0
      ReorgTooDeep, execution started ~5 min after restart, full saw-tooth recovery.

    3. Deep gap (~17k blocks), Lighthouse (--externalcl). Recovery path goes through EngineBlockDownloader instead of
      BackwardBeaconDownloader but lands in the same forkchoice. 0 ReorgTooDeep, execution started ~8 min after restart.

    behind commitment fires 1–3 times during the catch-up window in each case, then stops once TxNums overtakes commitBlock.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR re-introduces a “block catchup” recovery path by teaching forkchoice to skip unwinding when chaindata canonical (TxNums) lags behind state, and removes prior snapshot-stage/state-collation mechanisms that attempted to handle “state ahead of blocks”.

Changes:

  • Update forkchoice (FCU) to skip the unwind path when unwindTarget is at/above the current canonical tip derived from TxNums.Last, and proceed with canonicalization + TxNums forward-fill instead.
  • Remove the “state ahead of blocks” mitigation hooks: FrozenBlocksProvider/SetFrozenBlocksProvider/MaxCollatableTxNum and the snapshot-stage alignStateToBlockSnapshots deletion-based recovery.
  • Remove associated tests and wiring in node setup / integration tooling that configured the removed cap provider.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
node/eth/backend.go Stops configuring the aggregator with the removed SetFrozenBlocksProvider hook.
execution/stagedsync/stage_snapshots.go Removes snapshot-stage state/block alignment deletion loop; keeps snapshot indexing/DB fill path.
execution/stagedsync/stage_snapshots_test.go Deletes unit tests that only covered the removed alignment helpers.
execution/execmodule/forkchoice.go Adds FCU recovery logic to skip unwind when TxNums canonical lags and then forward-fill canonicals/TxNums.
db/test/aggregator_ext_test.go Removes tests tied to the removed FrozenBlocksProvider cap behavior.
db/state/aggregator.go Removes FrozenBlocksProvider API and cap logic; adjusts readyForCollation logging/behavior accordingly.
db/services/snapshot_progress.go Removes MaxCollatableTxNum helper that depended on the removed cap design.
cmd/utils/app/snapshots_cmd.go Removes wiring of SetFrozenBlocksProvider in snapshot/retire command path.
cmd/integration/commands/stages.go Removes wiring of SetFrozenBlocksProvider in integration snapshot setup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread execution/execmodule/forkchoice.go
Comment thread db/state/aggregator.go Outdated
@AskAlexSharov AskAlexSharov enabled auto-merge May 27, 2026 01:30
Comment thread db/state/aggregator.go Outdated
Comment thread execution/execmodule/forkchoice.go
@sudeepdino008

sudeepdino008 commented May 28, 2026

Copy link
Copy Markdown
Member Author
Screenshot 2026-05-28 at 6 31 58 AM

seems stable now; db size at 16 gb (3 gb data; 13gb reclaimable space) -- will look at it next.

Will look into ci failure today
EDIT: seems like eest-spec-tests failure is not flake -- it's due to my changes.

@AskAlexSharov AskAlexSharov added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
@sudeepdino008 sudeepdino008 added this pull request to the merge queue May 29, 2026
@sudeepdino008 sudeepdino008 removed this pull request from the merge queue due to a manual request May 29, 2026
sudeepdino008 added a commit that referenced this pull request May 29, 2026
// FCU's commit defers complete — so do one more idempotent FCU for the
// last block to ensure commitBlock has settled before the caller reads it.
if len(blocks) > 0 {
if _, err := updateForkChoice(ctx, exec, blocks[len(blocks)-1].Header()); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yperbasis fyi #21444 causing issues

@sudeepdino008 sudeepdino008 added this pull request to the merge queue May 29, 2026
@sudeepdino008 sudeepdino008 removed this pull request from the merge queue due to a manual request May 29, 2026
@sudeepdino008 sudeepdino008 added this pull request to the merge queue May 29, 2026
Merged via the queue into main with commit 6e39d0c May 29, 2026
91 checks passed
@sudeepdino008 sudeepdino008 deleted the collation_cont branch May 29, 2026 14:12
sudeepdino008 added a commit that referenced this pull request Jun 3, 2026
- Merge `origin/main` (up to #21546) into the `performance` branch.
- Conflicts resolved by taking main's finalized form where the perf
branch was behind (`ExistenceFilterVersion`→1 per #21164,
`mdbx-go`→v0.40.1, `merge.go` `findMergeRangeInFiles` refactor, dropped
`erigon-snapshot` module dep, fusefilter deferred-close refactor,
`Versions.MustSupport`, atomic per-key prune throttle).
- Adopted main's collation-at-tip design (`CollateAndPrune` in the FCU
path, #21398/#21415) and removed the perf branch's older
`frozenBlocks`-gating (`SetFrozenBlocksProvider`/`MaxCollatableTxNum`,
`db/services/snapshot_progress.go`, its gating tests, and callers).
- Verified: `make erigon integration` build, `make lint` (clean), `make
test-short` (green).
awskii added a commit that referenced this pull request Jun 4, 2026
Resolve 3 conflicts from main's overlapping refactors:
- db/state/aggregator.go: drop frozenBlocks/FrozenBlocksProvider (removed
  repo-wide by #21415 block-catchup-recovery), keep commitmentRefsOverride.
- db/state/domain_committed.go: keep both the PR's reshorten gate and
  #21303's per-merge findShortenedKey caches; the vt closure uses all three.
- db/state/squeeze.go: keep the PR rename ForTestReferencesInCommitmentBranches
  and the wantsReferencesInBranches guard over main's old ForTestReplaceKeysInValues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants