Skip to content

commitment, execctx: fix InsertBlocks failure during block catch-up#20546

Merged
sudeepdino008 merged 1 commit into
mainfrom
state_ahead_fix
Apr 14, 2026
Merged

commitment, execctx: fix InsertBlocks failure during block catch-up#20546
sudeepdino008 merged 1 commit into
mainfrom
state_ahead_fix

Conversation

@sudeepdino008

@sudeepdino008 sudeepdino008 commented Apr 14, 2026

Copy link
Copy Markdown
Member

Summary

Fixes a chicken-and-egg failure where ExecModule.InsertBlocks refuses to run during block catch-up, because NewSharedDomains returned ErrBehindCommitment when the persisted commitment state was ahead of the TxNums index:

ethereumExecutionModule.InsertBlocks: could not create shared domains:
behind commitment: TxNums index is at block 2579999 and behind commitment 2584312

The node is precisely in this state because it needs to download the missing blocks via InsertBlocks — but the call bails out before any block can be written. InsertBlocks didn't handle ErrBehindCommitment separately, so catch-up couldn't make progress.

Framing

ErrBehindCommitment is a statement about the environment (TxNums index is behind commitment), not about SharedDomains itself. SharedDomains has no trouble initialising in the state-ahead-of-blocks scenario — the patricia trie, txNum, and concurrent-commitment setup can all be restored from the persisted commitment state normally. Arguably, NewSharedDomains returning an error at all for an environmental condition is itself anti-pattern; callers could just probe the environment directly. This PR keeps the error to minimise the change surface — the error stays as a signal for the handful of callers that already use it — but reworks where the check lives so the SD is fully initialised before the error fires.

What changed

  • SeekCommitment no longer performs the state-ahead check — it just restores and returns. The check is an environmental probe, not a restoration concern, so it doesn't belong in that method.
  • NewSharedDomainsWithTrieVariant probes TxNums.Last against the restored commitment block after SeekCommitment and returns (sd, ErrBehindCommitment) when state is ahead. The SD is reliably fully-initialised in that case.
  • InsertBlocks treats ErrBehindCommitment as a non-fatal signal — block metadata writes into the overlay proceed with the fully-initialised SD, letting catch-up drive forward. AppendCanonicalTxNums (in the FCU path) advances TxNums each cycle until it catches up to commitment, after which the signal stops firing.

What didn't change

  • All ~45 non-handler callers of NewSharedDomains (RPC paths, block builder, receipts generator, stage tools, integration commands) still fail-fast on ErrBehindCommitment — we do not silently switch them to operate against a state-ahead SD (would be unsafe: eth_call returning state from beyond canonical head, builder building on advanced state, receipts re-executed against wrong state).
  • The 5 other handler callers (forkchoice.go:209, IsDomainAheadOfBlocks, executor.go:207, stageloop.go:150, sync.go:143) keep the same errors.Is(err, ErrBehindCommitment) idiom. No behaviour change at those sites.

Test plan

  • make lint — 0 issues
  • go test -count=1 ./execution/commitment/... ./db/state/execctx/... ./execution/execmodule/... — all pass
  • New unit test TestNewSharedDomains_StateAheadOfBlocks: asserts NewSharedDomains returns (sd, ErrBehindCommitment) with sd != nil and sd.TxNum() == commitTxNum (fully restored) when commitment block > TxNums.Last()
  • End-to-end on /erigon-data/hoodi_arch_1_1_1: ~35K blocks past the previously-stuck point, 0 behind commitment warnings, no panics
  • End-to-end on /erigon-data/hoodi_arch_1_2: ~19K blocks crossing the same 2580000+ boundary where the bug manifests, 0 warnings
  • make test-all (CI)

@sudeepdino008 sudeepdino008 marked this pull request as draft April 14, 2026 08:59
@sudeepdino008 sudeepdino008 changed the title commitment, execctx: decouple ErrBehindCommitment guard from signal commitment, execctx: fix InsertBlocks failure during block catch-up Apr 14, 2026
@sudeepdino008 sudeepdino008 force-pushed the state_ahead_fix branch 3 times, most recently from d084321 to c2264f1 Compare April 14, 2026 09:18
@sudeepdino008 sudeepdino008 marked this pull request as ready for review April 14, 2026 09:20
@AskAlexSharov AskAlexSharov requested a review from Copilot April 14, 2026 09:21

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a catch-up deadlock where ExecModule.InsertBlocks could not run if the persisted commitment state was ahead of the TxNums index, by allowing NewSharedDomains to return a usable SharedDomains while still signaling ErrBehindCommitment.

Changes:

  • Treat ErrBehindCommitment as non-fatal in ExecModule.InsertBlocks, allowing block catch-up to proceed.
  • Move the “TxNums behind commitment” check out of SeekCommitment and into NewSharedDomainsWithTrieVariant after restoration, so SD initialization completes.
  • Add a unit test ensuring NewSharedDomains returns (sd, ErrBehindCommitment) with sd fully restored in the “state ahead of blocks” scenario.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
execution/execmodule/inserters.go Allows InsertBlocks to continue when NewSharedDomains signals ErrBehindCommitment.
execution/commitment/commitmentdb/commitment_context.go Removes environment validation from SeekCommitment so it focuses on restoring commitment state.
db/state/execctx/domain_shared.go Adds the environment probe after SeekCommitment and returns ErrBehindCommitment while keeping SD initialized.
db/state/execctx/domain_shared_test.go Adds a regression test for the “commitment ahead of TxNums” catch-up condition.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/state/execctx/domain_shared.go
Comment thread db/state/execctx/domain_shared.go Outdated
Comment thread execution/execmodule/inserters.go Outdated
ExecModule.InsertBlocks refused to run when the persisted commitment
state was ahead of the TxNums index, because NewSharedDomains returned
ErrBehindCommitment:

  ethereumExecutionModule.InsertBlocks: could not create shared domains:
  behind commitment: TxNums index is at block 2579999 and behind
  commitment 2584312

The node is precisely in this state because it needs InsertBlocks to
download the missing blocks — chicken and egg. InsertBlocks did not
handle ErrBehindCommitment separately, so catch-up could not progress.

ErrBehindCommitment is a statement about the environment (TxNums
index is behind commitment), not about SharedDomains itself. The SD
has no trouble initialising in this scenario — patricia trie, txNum,
and concurrent commitment all restore from the persisted state
normally. Arguably returning an error from NewSharedDomains for an
environmental condition is itself anti-pattern, but this PR keeps it
to minimise the change surface: the error stays as a signal for the
handful of callers that already use it, while the SD becomes reliably
fully-initialised before the error fires.

Changes:

- SeekCommitment no longer performs the state-ahead check; it just
  restores and returns. The check is an environmental probe, not a
  restoration concern.
- NewSharedDomainsWithTrieVariant probes TxNums.Last against the
  restored commitment block after SeekCommitment and returns
  (sd, ErrBehindCommitment) when state is ahead. sd is fully
  initialised in that case.
- InsertBlocks treats ErrBehindCommitment as a non-fatal signal so
  block metadata writes into the overlay proceed; AppendCanonicalTxNums
  in the FCU path advances TxNums each cycle until commitment and
  blocks reconverge.

The ~45 non-handler NewSharedDomains callers keep failing fast on
ErrBehindCommitment — we do not silently switch them to operate
against a state-ahead SD. The 5 other handler sites (forkchoice,
IsDomainAheadOfBlocks, executor, stageloop, sync) keep the same
errors.Is idiom unchanged.

Verified end-to-end on two hoodi archive datadirs: block insertion
progresses cleanly across the state-ahead boundary with zero "behind
commitment" warnings and no panics.
@sudeepdino008 sudeepdino008 enabled auto-merge April 14, 2026 10:04
@sudeepdino008 sudeepdino008 added this pull request to the merge queue Apr 14, 2026
sudeepdino008 added a commit that referenced this pull request Apr 14, 2026
…ch-up

Cherry-pick of #20546 from main.

InsertBlocks on 3.4 uses BeginRw directly (not SharedDomains), so the
inserters.go portion of the main PR is a no-op here. This cherry-pick
keeps the general SharedDomains contract improvement: SeekCommitment
always fully restores the SD, and NewSharedDomains attaches
ErrBehindCommitment as an environmental signal probed via TxNums.Last.
Merged via the queue into main with commit ad7ea1e Apr 14, 2026
36 checks passed
@sudeepdino008 sudeepdino008 deleted the state_ahead_fix branch April 14, 2026 11:24
sudeepdino008 added a commit that referenced this pull request Apr 15, 2026
…ch-up (#20555)

Cherry-pick of #20546.

InsertBlocks on 3.4 uses `BeginRw` directly (not `SharedDomains`), so
the `inserters.go` portion of #20546 is a no-op here. This cherry-pick
keeps the general `SharedDomains` contract improvement: `SeekCommitment`
always fully restores the SD, and `NewSharedDomains` attaches
`ErrBehindCommitment` as an environmental signal probed via
`TxNums.Last` at the construction boundary.
@AskAlexSharov

Copy link
Copy Markdown
Collaborator

from mark:

We can't just insert blocks to catch up because:

1. If we re-execute we'll get state related errors - unless we unwind
2. If we don't re-execute we're inserting unvarified blocks into the database

is it true? smells like true, but "inserting blocks" and "marking them as Canonical" is 2 different things - don't know if this PR can mark bad blocks as canonical?

@sudeepdino008

Copy link
Copy Markdown
Member Author

from mark:

We can't just insert blocks to catch up because:

1. If we re-execute we'll get state related errors - unless we unwind
2. If we don't re-execute we're inserting unvarified blocks into the database

is it true? smells like true, but "inserting blocks" and "marking them as Canonical" is 2 different things - don't know if this PR can mark bad blocks as canonical?

If we don't re-execute we're inserting unvarified blocks into the database

  • inserting blocks without re-exec will get us from "state ahead of blocks" to "blocks ahead of state". In practice, the switching point will be well before finalized checkpoint though.
  • even in caplin's forward sync today, we do insertBlocks first (without executing them) and then do fcu in batch of 5000 blocks (which then executes it) -- so we're already inserting unverified blocks today, and erigon can handle that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants