cl/phase1/forkchoice: align getFilterBlockTree with consensus spec#21327
Merged
Conversation
The previous implementation deviated from the consensus spec's filter_block_tree / get_voting_source in two ways, compounding to reject valid current-epoch leaves at every epoch boundary: 1. Voting source selection used unrealized unconditionally. Spec get_voting_source picks unrealized only for prior-epoch blocks (to pull-up justification view); current-epoch blocks should use the block's realized state checkpoint. 2. The justified/finalized viability check was a strict Equal(). Spec correct_justified allows voting_source.epoch + 2 >= current_epoch as a lenient 2-epoch lookback. Both fixes restore the spec-faithful behavior. Observed effect on bloatnet: epoch-boundary head regression that caused ~30-block execution unwinds every ~6 minutes is eliminated. Refs: #21301
Contributor
There was a problem hiding this comment.
Pull request overview
Aligns Caplin forkchoice head filtering (getFilterBlockTree) with the Ethereum consensus spec to avoid rejecting valid current-epoch leaves around epoch boundaries, which can otherwise cause head regressions and downstream execution unwinds.
Changes:
- Snapshot the store’s
currentEpochonce per tree-walk to avoid epoch-mixing across leaves whenOnTickupdates time concurrently. - Implement spec-faithful
get_voting_sourcebranching (unrealized only for prior-epoch blocks; realized checkpoint for current/future-epoch blocks). - Implement spec-faithful
correct_justified(three-clause) andcorrect_finalized(finalized ancestor) checks.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
domiwei
approved these changes
May 21, 2026
AskAlexSharov
approved these changes
May 21, 2026
3 tasks
AskAlexSharov
pushed a commit
that referenced
this pull request
May 22, 2026
…heckpoints (#21339) ## Summary Follow-up to #21310. Fixes a remaining spec deviation in `OnBlock`'s prior-epoch `updateCheckpoints` call: per [`compute_pulled_up_tip`](https://github.com/ethereum/consensus-specs/blob/master/specs/phase0/fork-choice.md), the prior-epoch update should use the post-pull-up checkpoints (after \`process_justification_and_finalization\` runs), not the realized values that were restored onto the state for replay purposes. ## The bug In \`cl/phase1/forkchoice/on_block.go\`: \`\`\`go ProcessJustificationBitsAndFinality(lastProcessedState, nil) // pull-up runs here unrealizedJustifications.Store(.., lastProcessedState.CurrentJustifiedCheckpoint()) ... // restore pre-pull-up values onto lastProcessedState (for replay/simulation) lastProcessedState.SetCurrentJustifiedCheckpoint(currentJustifiedCheckpoint) lastProcessedState.SetFinalizedCheckpoint(stateFinalized) ... if blockEpoch < currentEpoch { // BUG: now reads the just-restored pre-pull-up values, not the pulled-up ones updateCheckpoints(lastProcessedState.CurrentJustifiedCheckpoint(), lastProcessedState.FinalizedCheckpoint()) } \`\`\` Effect: \`store.justified_checkpoint\` / \`store.finalized_checkpoint\` get updated with realized values rather than the post-pull-up values for late-arriving prior-epoch blocks. The store ends up one epoch boundary behind what the spec would have. ## Fix Capture the post-pull-up values into local variables right after \`ProcessJustificationBitsAndFinality\` and use them consistently in: 1. \`unrealizedJustifications.Store\` / \`unrealizedFinalizations.Store\` (already pull-up, no behavior change) 2. \`updateUnrealizedCheckpoints\` (already pull-up, no behavior change) 3. \`updateCheckpoints\` in the \`blockEpoch < currentEpoch\` branch (**this is the actual fix**) ## Test plan - [x] Build clean - [x] Runtime: validated alongside #21310 on bloatnet at chain tip — 23+ FCUs all at \`lagSlots=0\`, no large unwinds, no filter rejects after this change is applied - [ ] CI green Refs: #21301, #21310 ## Already validated on main This same change (commit `3cfb67d6ef`) is included in #21327 (`cp/21310-to-main` → `main`), where **all CI checks are passing**: lint, unit tests, race tests, integration tests, kurtosis, mainnet-rpc-integ, hive, reproducible build across linux/mac/windows. See #21327. So this PR is the same patch already exercised by CI on the main-branch flavor of #21310 — applying it to `performance` for parity.
This was referenced May 27, 2026
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 28, 2026
…mit (erigontech#21444) ## Summary `ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine and waited on a `done` channel that closes only after that goroutine **fully returns** — i.e. after flush + commit + prune. For a merge-extending fork at tip the forkchoice result is already sent on `outcomeCh` *before* the commit (`ParallelStateFlushing` is on by default), so the `<-done` wait threw that early-send away and blocked the consensus client for the entire EL commit. This removes the `<-done` wait: `UpdateForkChoice` returns as soon as the result lands on `outcomeCh`. ## Live-tip A/B (this branch vs main) Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what Caplin's clstages loop blocks on), live mainnet tip, same datadir/machine/instrumentation: | FCU-response wall (ms) | n | min | p50 | p90 | max | mean | |------------------------|--:|----:|----:|----:|----:|-----:| | **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 | | **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 | ~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary. The `Timings: Forkchoice commit=…ms` line is unchanged — the commit still costs the same, it just no longer blocks the consensus loop. safe/finalized cadence is identical between branch and main, and wall does not correlate with safe/finalized changes, so this is not about forkchoice args. Both A/B binaries are post-erigontech#21327, so the comparison isolates only the `<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035 regression (both sides already have its fix). ## Scope / relationship to erigontech#21008 This is an **independent FCU-response-latency improvement**, not the fix for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches no `execution/execmodule/` code — its `getFilterBlockTree` spec-deviation regressed `GetHead` at epoch boundaries and is itself already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that commit measured 100% in-sync), so the EL-side blocking this PR removes is **not** what caused the 100→3% collapse. It is a real, separate cost worth removing on its own merits: at tip it returns the consensus loop to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the commit overlaps a background merge/prune burst). ## Why it's safe The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could safely acquire the EL semaphore. That guarantee is preserved without it: - In `updateForkChoice`, the `defer e.semaphore.Release(1)` is registered **first**, so it runs **last** — after every cleanup defer (`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that subsequently acquires the semaphore necessarily observes fully-settled state. - `ExecutionClientDirect.ForkChoiceUpdate` already retries `AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU racing an in-flight commit resolves itself. - release/3.4 has no `<-done` and is the proven-good baseline. - Non-merge FCUs send their result at the end of `updateForkChoice` anyway, so their timing is unchanged. ## Test plan - [x] `execution/engineapi` `TestEngineApi*` block-production suite (the FCU→AssembleBlock path `<-done` guarded) - [x] `execution/execmodule/...` - [x] `cl/beacon/handler` `TestCaplinBlockProduction*` - [x] `make lint` clean - [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms → 1.2 ms, 0 gap errors over an epoch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-pick of #21310 to main.
main-specific adaptations
cl/phase1/forkchoice/get_head.goresolved in favour of cl/phase1/forkchoice: align getFilterBlockTree with consensus spec #21310's spec-faithful version, replacing main's existing attempt (>=check + previous-epoch-justified fallback) with the spec's three-clausecorrect_justifiedand theget_voting_sourceconditional branching.f.Ancestor(blockRoot, finalizedSlot)→f.Ancestor(blockRoot, finalizedSlot).Rootsince on mainAncestorreturnsForkChoiceNode(GLOAS / EIP-7732 change) rather thancommon.Hash.