cl/phase1/forkchoice: align getFilterBlockTree with consensus spec#21310
Merged
sudeepdino008 merged 1 commit intoMay 21, 2026
Merged
Conversation
d38cfdb to
e43413b
Compare
5 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
Aligns Caplin forkchoice getFilterBlockTree behavior with the Ethereum consensus spec to prevent valid current-epoch leaves from being incorrectly filtered at epoch boundaries, which can cause GetHead() regressions and execution unwinds.
Changes:
- Implements spec-faithful
get_voting_sourceselection by using unrealized checkpoints only for prior-epoch blocks and realized state checkpoints for current/future-epoch blocks. - Adds the spec’s lenient
+2 >= current_epochlookback condition for justified/finalized checkpoint filtering.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e43413b to
9d225e6
Compare
9d225e6 to
f0172e9
Compare
sudeepdino008
added a commit
that referenced
this pull request
May 21, 2026
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent for the finalized check but left two justified-side deviations from the spec's filter_block_tree / get_voting_source: 1. Voting source selection used unrealized unconditionally with a realized fallback. Spec get_voting_source branches strictly on current_epoch > block_epoch — unrealized only for prior-epoch blocks; current-epoch blocks must use the block's realized current_justified_checkpoint. 2. correct_justified used voting_source.epoch >= justified.epoch with an is_previous_epoch_justified-gated +2 fallback. Spec is the three flat disjuncts: epoch == GENESIS, voting_source.epoch == justified.epoch, or voting_source.epoch + 2 >= current_epoch. The finalized ancestor-descent check from f3ad70f is preserved unchanged. Observed effect on bloatnet: epoch-boundary head regression that caused ~30-block execution unwinds every ~6 minutes is eliminated. Validated by 1 hour at chain tip on bloatnet covering 5+ epoch boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0 on all 172 forkchoice updates in the window. Refs: #21301, #21310
sudeepdino008
added a commit
that referenced
this pull request
May 21, 2026
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent for the finalized check but left two justified-side deviations from the spec's filter_block_tree / get_voting_source: 1. Voting source selection used unrealized unconditionally with a realized fallback. Spec get_voting_source branches strictly on current_epoch > block_epoch — unrealized only for prior-epoch blocks; current-epoch blocks must use the block's realized current_justified_checkpoint. 2. correct_justified used voting_source.epoch >= justified.epoch with an is_previous_epoch_justified-gated +2 fallback. Spec is the three flat disjuncts: epoch == GENESIS, voting_source.epoch == justified.epoch, or voting_source.epoch + 2 >= current_epoch. The finalized ancestor-descent check from f3ad70f is preserved unchanged. Observed effect on bloatnet: epoch-boundary head regression that caused ~30-block execution unwinds every ~6 minutes is eliminated. Validated by 1 hour at chain tip on bloatnet covering 5+ epoch boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0 on all 172 forkchoice updates in the window. Refs: #21301, #21310
The previous implementation deviated from the consensus spec's filter_block_tree / get_voting_source in two ways, compounding to reject valid current-epoch leaves at every epoch boundary: 1. Voting source selection used unrealized unconditionally. Spec get_voting_source picks unrealized only for prior-epoch blocks (to pull-up justification view); current-epoch blocks should use the block's realized state checkpoint. 2. The justified/finalized viability check was a strict Equal(). Spec correct_justified allows voting_source.epoch + 2 >= current_epoch as a lenient 2-epoch lookback. Both fixes restore the spec-faithful behavior. Observed effect on bloatnet: epoch-boundary head regression that caused ~30-block execution unwinds every ~6 minutes is eliminated. Refs: #21301
f0172e9 to
2f2f6fb
Compare
domiwei
approved these changes
May 21, 2026
3 tasks
vh0rvath
pushed a commit
to vh0rvath/erigon
that referenced
this pull request
May 21, 2026
…rigontech#21327) Cherry-pick of erigontech#21310 to main. ## main-specific adaptations - Conflict in `cl/phase1/forkchoice/get_head.go` resolved in favour of erigontech#21310's spec-faithful version, replacing main's existing attempt (`>=` check + previous-epoch-justified fallback) with the spec's three-clause `correct_justified` and the `get_voting_source` conditional branching. - Adapted `f.Ancestor(blockRoot, finalizedSlot)` → `f.Ancestor(blockRoot, finalizedSlot).Root` since on main `Ancestor` returns `ForkChoiceNode` (GLOAS / EIP-7732 change) rather than `common.Hash`.
AskAlexSharov
pushed a commit
that referenced
this pull request
May 22, 2026
…heckpoints (#21339) ## Summary Follow-up to #21310. Fixes a remaining spec deviation in `OnBlock`'s prior-epoch `updateCheckpoints` call: per [`compute_pulled_up_tip`](https://github.com/ethereum/consensus-specs/blob/master/specs/phase0/fork-choice.md), the prior-epoch update should use the post-pull-up checkpoints (after \`process_justification_and_finalization\` runs), not the realized values that were restored onto the state for replay purposes. ## The bug In \`cl/phase1/forkchoice/on_block.go\`: \`\`\`go ProcessJustificationBitsAndFinality(lastProcessedState, nil) // pull-up runs here unrealizedJustifications.Store(.., lastProcessedState.CurrentJustifiedCheckpoint()) ... // restore pre-pull-up values onto lastProcessedState (for replay/simulation) lastProcessedState.SetCurrentJustifiedCheckpoint(currentJustifiedCheckpoint) lastProcessedState.SetFinalizedCheckpoint(stateFinalized) ... if blockEpoch < currentEpoch { // BUG: now reads the just-restored pre-pull-up values, not the pulled-up ones updateCheckpoints(lastProcessedState.CurrentJustifiedCheckpoint(), lastProcessedState.FinalizedCheckpoint()) } \`\`\` Effect: \`store.justified_checkpoint\` / \`store.finalized_checkpoint\` get updated with realized values rather than the post-pull-up values for late-arriving prior-epoch blocks. The store ends up one epoch boundary behind what the spec would have. ## Fix Capture the post-pull-up values into local variables right after \`ProcessJustificationBitsAndFinality\` and use them consistently in: 1. \`unrealizedJustifications.Store\` / \`unrealizedFinalizations.Store\` (already pull-up, no behavior change) 2. \`updateUnrealizedCheckpoints\` (already pull-up, no behavior change) 3. \`updateCheckpoints\` in the \`blockEpoch < currentEpoch\` branch (**this is the actual fix**) ## Test plan - [x] Build clean - [x] Runtime: validated alongside #21310 on bloatnet at chain tip — 23+ FCUs all at \`lagSlots=0\`, no large unwinds, no filter rejects after this change is applied - [ ] CI green Refs: #21301, #21310 ## Already validated on main This same change (commit `3cfb67d6ef`) is included in #21327 (`cp/21310-to-main` → `main`), where **all CI checks are passing**: lint, unit tests, race tests, integration tests, kurtosis, mainnet-rpc-integ, hive, reproducible build across linux/mac/windows. See #21327. So this PR is the same patch already exercised by CI on the main-branch flavor of #21310 — applying it to `performance` for parity.
This was referenced May 27, 2026
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 28, 2026
…mit (erigontech#21444) ## Summary `ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine and waited on a `done` channel that closes only after that goroutine **fully returns** — i.e. after flush + commit + prune. For a merge-extending fork at tip the forkchoice result is already sent on `outcomeCh` *before* the commit (`ParallelStateFlushing` is on by default), so the `<-done` wait threw that early-send away and blocked the consensus client for the entire EL commit. This removes the `<-done` wait: `UpdateForkChoice` returns as soon as the result lands on `outcomeCh`. ## Live-tip A/B (this branch vs main) Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what Caplin's clstages loop blocks on), live mainnet tip, same datadir/machine/instrumentation: | FCU-response wall (ms) | n | min | p50 | p90 | max | mean | |------------------------|--:|----:|----:|----:|----:|-----:| | **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 | | **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 | ~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary. The `Timings: Forkchoice commit=…ms` line is unchanged — the commit still costs the same, it just no longer blocks the consensus loop. safe/finalized cadence is identical between branch and main, and wall does not correlate with safe/finalized changes, so this is not about forkchoice args. Both A/B binaries are post-erigontech#21327, so the comparison isolates only the `<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035 regression (both sides already have its fix). ## Scope / relationship to erigontech#21008 This is an **independent FCU-response-latency improvement**, not the fix for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches no `execution/execmodule/` code — its `getFilterBlockTree` spec-deviation regressed `GetHead` at epoch boundaries and is itself already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that commit measured 100% in-sync), so the EL-side blocking this PR removes is **not** what caused the 100→3% collapse. It is a real, separate cost worth removing on its own merits: at tip it returns the consensus loop to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the commit overlaps a background merge/prune burst). ## Why it's safe The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could safely acquire the EL semaphore. That guarantee is preserved without it: - In `updateForkChoice`, the `defer e.semaphore.Release(1)` is registered **first**, so it runs **last** — after every cleanup defer (`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that subsequently acquires the semaphore necessarily observes fully-settled state. - `ExecutionClientDirect.ForkChoiceUpdate` already retries `AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU racing an in-flight commit resolves itself. - release/3.4 has no `<-done` and is the proven-good baseline. - Non-merge FCUs send their result at the end of `updateForkChoice` anyway, so their timing is unchanged. ## Test plan - [x] `execution/engineapi` `TestEngineApi*` block-production suite (the FCU→AssembleBlock path `<-done` guarded) - [x] `execution/execmodule/...` - [x] `cl/beacon/handler` `TestCaplinBlockProduction*` - [x] `make lint` clean - [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms → 1.2 ms, 0 gap errors over an epoch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two spec-deviations in
getFilterBlockTreecompound to reject valid current-epoch leaves at every epoch boundary, causing Caplin'sGetHeadto fall back to the justified checkpoint root (a ~30-50 slot regression) and triggering execution-side unwinds.Observed symptom (from #21301)
On bloatnet, once execution catches up to chain tip, the node enters a steady-state cycle of 22-36 block execution unwinds every ~6 minutes — one per epoch boundary. Histogram from one ~3h run (110 unwind events):
The unwinds are on the same chain (no real reorg) — execution is forced to roll back to an older head because Caplin regresses its
GetHead()return value:The 08:44:00 FCU triggers a 23-block exec unwind:
hashesDiffer=falseconfirms same-chain.executionAhead=truebecause exec raced ahead of Caplin's regressing head.Root cause: filter rejects mid-epoch leaves
Tracing inside
GetHead:Every leaf in the tree gets filtered because its
unrealizedJustifications[blockRoot].epoch = N+1but the store'sjustifiedCheckpoint.epoch = N(store's realized lags by 1 epoch mid-epoch). The walk falls back to the justified root → 50+ slot regression.Spec deviations (the fix)
1. Voting-source selection used unrealized unconditionally
Spec
get_voting_sourcepicks unrealized only for prior-epoch blocks (pull-up justification view); current-epoch blocks should use the block's realized state checkpoint:Erigon's code:
→ Always picks unrealized when available. The "fall back" comment misreads the spec — spec is a conditional branch, not a fallback.
2. Justified/finalized check was strict equality
Spec
filter_block_treehas a lenient+2 >= current_epochlookback:Erigon's code only allowed the first two clauses (strict
Equal()).Regression provenance
The deviations were introduced in #20035 (
caplin: unified Engine API client for standalone mode, Mar 25 2026) by Mark Holt:Before that PR,
getFilterBlockTreeusedf.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot)directly — which returns the block's realized justified checkpoint. Block.realized matched store.realized (both on the same epoch-boundary timeline), so the equality check passed.PR #20035 introduced the per-block unrealized lookup (
store.unrealized_justificationsper the spec), but missed the conditional branching (spec deviation #1) and didn't add the +2 lookback (spec deviation #2). The result: block.unrealized goes ahead by 1 epoch mid-epoch, and store.realized doesn't catch up untilOnTick's epoch-boundary path runs.Fix
Restore spec-faithful behavior in
getFilterBlockTree:Branch the voting source selection on
currentEpoch > blockEpoch:store.unrealized_justifications[block_root]current_justified_checkpointIn both branches, return
false(reject the leaf) if the lookup is absent.OnBlockpopulatesunrealizedJustificationsfor every imported block above the finalized slot, so a missing entry indicates either an invariant violation or a leaf below finalized — neither should produce a viable head.Replace the strict
Equal()check on the justified side with the spec's three flat disjuncts:store.justified_checkpoint.epoch == GENESIS_EPOCH || voting_source.epoch == store.justified_checkpoint.epoch || voting_source.epoch + 2 >= current_epoch.Replace the finalized-side checkpoint-equality check with the spec's ancestor-descent rule:
store.finalized_checkpoint.root == get_checkpoint_block(block_root, finalized.epoch), implemented viaf.Ancestor(blockRoot, finalizedSlot).Snapshot
currentEpochonce at the start of the walk (ingetFilteredBlockTree) and thread it through the recursion so the whole rebuild uses a single store-epoch value —OnTickupdatesf.timewithout holdingf.mu, so per-leaff.Slot()reads can mix epochs across leaves around a slot boundary.Validation
Tested on bloatnet at chain tip, on
performanceHEAD + this fix:[filterTree] rejectno longer fires.Effects:
Unwind Executionlines at tip.Test plan
Caplin is sending forkchoicelogs (lagSlots stays ~0 at tip)make spectest)Refs
Full diagnosis trail: #21301