commitment, execctx: fix InsertBlocks failure during block catch-up#20546
Conversation
d084321 to
c2264f1
Compare
c2264f1 to
112640f
Compare
There was a problem hiding this comment.
Pull request overview
Fixes a catch-up deadlock where ExecModule.InsertBlocks could not run if the persisted commitment state was ahead of the TxNums index, by allowing NewSharedDomains to return a usable SharedDomains while still signaling ErrBehindCommitment.
Changes:
- Treat
ErrBehindCommitmentas non-fatal inExecModule.InsertBlocks, allowing block catch-up to proceed. - Move the “TxNums behind commitment” check out of
SeekCommitmentand intoNewSharedDomainsWithTrieVariantafter restoration, so SD initialization completes. - Add a unit test ensuring
NewSharedDomainsreturns(sd, ErrBehindCommitment)withsdfully restored in the “state ahead of blocks” scenario.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
execution/execmodule/inserters.go |
Allows InsertBlocks to continue when NewSharedDomains signals ErrBehindCommitment. |
execution/commitment/commitmentdb/commitment_context.go |
Removes environment validation from SeekCommitment so it focuses on restoring commitment state. |
db/state/execctx/domain_shared.go |
Adds the environment probe after SeekCommitment and returns ErrBehindCommitment while keeping SD initialized. |
db/state/execctx/domain_shared_test.go |
Adds a regression test for the “commitment ahead of TxNums” catch-up condition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
112640f to
c30e2bf
Compare
ExecModule.InsertBlocks refused to run when the persisted commitment state was ahead of the TxNums index, because NewSharedDomains returned ErrBehindCommitment: ethereumExecutionModule.InsertBlocks: could not create shared domains: behind commitment: TxNums index is at block 2579999 and behind commitment 2584312 The node is precisely in this state because it needs InsertBlocks to download the missing blocks — chicken and egg. InsertBlocks did not handle ErrBehindCommitment separately, so catch-up could not progress. ErrBehindCommitment is a statement about the environment (TxNums index is behind commitment), not about SharedDomains itself. The SD has no trouble initialising in this scenario — patricia trie, txNum, and concurrent commitment all restore from the persisted state normally. Arguably returning an error from NewSharedDomains for an environmental condition is itself anti-pattern, but this PR keeps it to minimise the change surface: the error stays as a signal for the handful of callers that already use it, while the SD becomes reliably fully-initialised before the error fires. Changes: - SeekCommitment no longer performs the state-ahead check; it just restores and returns. The check is an environmental probe, not a restoration concern. - NewSharedDomainsWithTrieVariant probes TxNums.Last against the restored commitment block after SeekCommitment and returns (sd, ErrBehindCommitment) when state is ahead. sd is fully initialised in that case. - InsertBlocks treats ErrBehindCommitment as a non-fatal signal so block metadata writes into the overlay proceed; AppendCanonicalTxNums in the FCU path advances TxNums each cycle until commitment and blocks reconverge. The ~45 non-handler NewSharedDomains callers keep failing fast on ErrBehindCommitment — we do not silently switch them to operate against a state-ahead SD. The 5 other handler sites (forkchoice, IsDomainAheadOfBlocks, executor, stageloop, sync) keep the same errors.Is idiom unchanged. Verified end-to-end on two hoodi archive datadirs: block insertion progresses cleanly across the state-ahead boundary with zero "behind commitment" warnings and no panics.
c30e2bf to
3cba170
Compare
…ch-up Cherry-pick of #20546 from main. InsertBlocks on 3.4 uses BeginRw directly (not SharedDomains), so the inserters.go portion of the main PR is a no-op here. This cherry-pick keeps the general SharedDomains contract improvement: SeekCommitment always fully restores the SD, and NewSharedDomains attaches ErrBehindCommitment as an environmental signal probed via TxNums.Last.
…ch-up (#20555) Cherry-pick of #20546. InsertBlocks on 3.4 uses `BeginRw` directly (not `SharedDomains`), so the `inserters.go` portion of #20546 is a no-op here. This cherry-pick keeps the general `SharedDomains` contract improvement: `SeekCommitment` always fully restores the SD, and `NewSharedDomains` attaches `ErrBehindCommitment` as an environmental signal probed via `TxNums.Last` at the construction boundary.
|
from mark: is it true? smells like true, but "inserting blocks" and "marking them as Canonical" is 2 different things - don't know if this PR can mark bad blocks as canonical? |
|
Summary
Fixes a chicken-and-egg failure where
ExecModule.InsertBlocksrefuses to run during block catch-up, becauseNewSharedDomainsreturnedErrBehindCommitmentwhen the persisted commitment state was ahead of theTxNumsindex:The node is precisely in this state because it needs to download the missing blocks via
InsertBlocks— but the call bails out before any block can be written.InsertBlocksdidn't handleErrBehindCommitmentseparately, so catch-up couldn't make progress.Framing
ErrBehindCommitmentis a statement about the environment (TxNums index is behind commitment), not about SharedDomains itself.SharedDomainshas no trouble initialising in the state-ahead-of-blocks scenario — the patricia trie,txNum, and concurrent-commitment setup can all be restored from the persisted commitment state normally. Arguably,NewSharedDomainsreturning an error at all for an environmental condition is itself anti-pattern; callers could just probe the environment directly. This PR keeps the error to minimise the change surface — the error stays as a signal for the handful of callers that already use it — but reworks where the check lives so the SD is fully initialised before the error fires.What changed
SeekCommitmentno longer performs the state-ahead check — it just restores and returns. The check is an environmental probe, not a restoration concern, so it doesn't belong in that method.NewSharedDomainsWithTrieVariantprobesTxNums.Lastagainst the restored commitment block afterSeekCommitmentand returns(sd, ErrBehindCommitment)when state is ahead. The SD is reliably fully-initialised in that case.InsertBlockstreatsErrBehindCommitmentas a non-fatal signal — block metadata writes into the overlay proceed with the fully-initialised SD, letting catch-up drive forward.AppendCanonicalTxNums(in the FCU path) advancesTxNumseach cycle until it catches up to commitment, after which the signal stops firing.What didn't change
NewSharedDomains(RPC paths, block builder, receipts generator, stage tools, integration commands) still fail-fast onErrBehindCommitment— we do not silently switch them to operate against a state-ahead SD (would be unsafe:eth_callreturning state from beyond canonical head, builder building on advanced state, receipts re-executed against wrong state).forkchoice.go:209,IsDomainAheadOfBlocks,executor.go:207,stageloop.go:150,sync.go:143) keep the sameerrors.Is(err, ErrBehindCommitment)idiom. No behaviour change at those sites.Test plan
make lint— 0 issuesgo test -count=1 ./execution/commitment/... ./db/state/execctx/... ./execution/execmodule/...— all passTestNewSharedDomains_StateAheadOfBlocks: assertsNewSharedDomainsreturns(sd, ErrBehindCommitment)withsd != nilandsd.TxNum() == commitTxNum(fully restored) when commitment block >TxNums.Last()/erigon-data/hoodi_arch_1_1_1: ~35K blocks past the previously-stuck point, 0behind commitmentwarnings, no panics/erigon-data/hoodi_arch_1_2: ~19K blocks crossing the same 2580000+ boundary where the bug manifests, 0 warningsmake test-all(CI)