Skip to content

cl/phase1/forkchoice: align getFilterBlockTree with consensus spec#21310

Merged
sudeepdino008 merged 1 commit into
performancefrom
sudeep/fix-getheadtree-spec-justification
May 21, 2026
Merged

cl/phase1/forkchoice: align getFilterBlockTree with consensus spec#21310
sudeepdino008 merged 1 commit into
performancefrom
sudeep/fix-getheadtree-spec-justification

Conversation

@sudeepdino008

@sudeepdino008 sudeepdino008 commented May 20, 2026

Copy link
Copy Markdown
Member

Summary

Two spec-deviations in getFilterBlockTree compound to reject valid current-epoch leaves at every epoch boundary, causing Caplin's GetHead to fall back to the justified checkpoint root (a ~30-50 slot regression) and triggering execution-side unwinds.

Observed symptom (from #21301)

On bloatnet, once execution catches up to chain tip, the node enters a steady-state cycle of 22-36 block execution unwinds every ~6 minutes — one per epoch boundary. Histogram from one ~3h run (110 unwind events):

   1 size=11
   1 size=14
   4 size=15
   1 size=16
   4 size=18
   2 size=19
   3 size=20
   5 size=21
   9 size=22
  14 size=23     ← most common
   3 size=24
   5 size=25
   4 size=26
   4 size=27
   8 size=28
   1 size=30
   2 size=31
   4 size=32
   3 size=33
   4 size=34
   ...many size=35-36

The unwinds are on the same chain (no real reorg) — execution is forced to roll back to an older head because Caplin regresses its GetHead() return value:

08:38:45 Caplin FCU: headSlot=652928  currentSlot=652988  lagSlots=60   eth1Head=0xe0f3...
08:39:36 Caplin FCU: headSlot=652993  currentSlot=652993  lagSlots=0    eth1Head=0x263b... ← at tip
08:44:00 Caplin FCU: headSlot=652959  currentSlot=653015  lagSlots=56   eth1Head=0x7674... ← regressed 34 slots

The 08:44:00 FCU triggers a 23-block exec unwind:

[forkchoice] entering unwind path  fcuNum=24701328  canonHash=0x7674... fcuHash=0x7674...
                                   hashesDiffer=false  finishProgress=24701350  executionAhead=true
Unwind Execution from=24701350 to=24701327

hashesDiffer=false confirms same-chain. executionAhead=true because exec raced ahead of Caplin's regressing head.

Root cause: filter rejects mid-epoch leaves

Tracing inside GetHead:

[GetHead] cache miss, rebuilding from justifiedCheckpoint  justifiedEpoch=20410
[filterTree] reject leaf: checkpoint mismatch  blockRoot=0xe644... slot=653150
            justifiedOk=false  finalizedOk=false
            storeJustEpoch=20410  blockJustEpoch=20411
            storeFinalEpoch=20409  blockFinalEpoch=20410
[GetHead] filtered tree size  viableBlocks=0  totalHeads=1
[GetHead] rebuilt head  headHash=0x... headSlot=653119  ← fallback to justified root
Caplin is sending forkchoice  headSlot=653119  currentSlot=653174  lagSlots=55

Every leaf in the tree gets filtered because its unrealizedJustifications[blockRoot].epoch = N+1 but the store's justifiedCheckpoint.epoch = N (store's realized lags by 1 epoch mid-epoch). The walk falls back to the justified root → 50+ slot regression.

Spec deviations (the fix)

1. Voting-source selection used unrealized unconditionally

Spec get_voting_source picks unrealized only for prior-epoch blocks (pull-up justification view); current-epoch blocks should use the block's realized state checkpoint:

def get_voting_source(store: Store, block_root: Root) -> Checkpoint:
    block = store.blocks[block_root]
    current_epoch = get_current_store_epoch(store)
    block_epoch = compute_epoch_at_slot(block.slot)
    if current_epoch > block_epoch:
        return store.unrealized_justifications[block_root]
    else:
        return store.block_states[block_root].current_justified_checkpoint

Erigon's code:

// Use per-block unrealized justifications (spec: store.unrealized_justifications[block_root])
// Fall back to realized checkpoints if unrealized not available
currentJustifiedCheckpoint, has := f.getUnrealizedJustification(blockRoot)
if !has {
    currentJustifiedCheckpoint, has = f.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot)
    ...
}

→ Always picks unrealized when available. The "fall back" comment misreads the spec — spec is a conditional branch, not a fallback.

2. Justified/finalized check was strict equality

Spec filter_block_tree has a lenient +2 >= current_epoch lookback:

correct_justified = (
    store.justified_checkpoint.epoch == GENESIS_EPOCH
    or voting_source.epoch == store.justified_checkpoint.epoch
    or voting_source.epoch + 2 >= current_epoch  # ← lenient lookback
)

Erigon's code only allowed the first two clauses (strict Equal()).

Regression provenance

The deviations were introduced in #20035 (caplin: unified Engine API client for standalone mode, Mar 25 2026) by Mark Holt:

082697d2a0  cl/phase1/forkchoice/get_head.go  +Use per-block unrealized justifications

Before that PR, getFilterBlockTree used f.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot) directly — which returns the block's realized justified checkpoint. Block.realized matched store.realized (both on the same epoch-boundary timeline), so the equality check passed.

PR #20035 introduced the per-block unrealized lookup (store.unrealized_justifications per the spec), but missed the conditional branching (spec deviation #1) and didn't add the +2 lookback (spec deviation #2). The result: block.unrealized goes ahead by 1 epoch mid-epoch, and store.realized doesn't catch up until OnTick's epoch-boundary path runs.

Fix

Restore spec-faithful behavior in getFilterBlockTree:

  1. Branch the voting source selection on currentEpoch > blockEpoch:

    • Prior-epoch → store.unrealized_justifications[block_root]
    • Current/future-epoch → block's realized current_justified_checkpoint

    In both branches, return false (reject the leaf) if the lookup is absent. OnBlock populates unrealizedJustifications for every imported block above the finalized slot, so a missing entry indicates either an invariant violation or a leaf below finalized — neither should produce a viable head.

  2. Replace the strict Equal() check on the justified side with the spec's three flat disjuncts: store.justified_checkpoint.epoch == GENESIS_EPOCH || voting_source.epoch == store.justified_checkpoint.epoch || voting_source.epoch + 2 >= current_epoch.

  3. Replace the finalized-side checkpoint-equality check with the spec's ancestor-descent rule: store.finalized_checkpoint.root == get_checkpoint_block(block_root, finalized.epoch), implemented via f.Ancestor(blockRoot, finalizedSlot).

  4. Snapshot currentEpoch once at the start of the walk (in getFilteredBlockTree) and thread it through the recursion so the whole rebuild uses a single store-epoch value — OnTick updates f.time without holding f.mu, so per-leaf f.Slot() reads can mix epochs across leaves around a slot boundary.

Validation

Tested on bloatnet at chain tip, on performance HEAD + this fix:

  • Before fix: 22-36 block unwinds every ~6 min, one per epoch boundary, persisting indefinitely. 110 events in 3h.
  • After fix: zero unwinds at epoch boundaries observed across 30+ minutes (~5 epoch boundaries crossed). [filterTree] reject no longer fires.

Effects:

  • CPU waste eliminated: ~30 blocks of execution work were being discarded and re-done every cycle.
  • db_size: unchanged (was already not affected since rolled-back state was in MemoryMutation overlay, not MDBX).
  • Logs: clean — no more spurious Unwind Execution lines at tip.

Test plan

  • Run on bloatnet to chain tip, observe no large unwinds at epoch boundaries
  • Confirm no head regression in Caplin is sending forkchoice logs (lagSlots stays ~0 at tip)
  • Spectest still passes (make spectest)

Refs

Full diagnosis trail: #21301

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns Caplin forkchoice getFilterBlockTree behavior with the Ethereum consensus spec to prevent valid current-epoch leaves from being incorrectly filtered at epoch boundaries, which can cause GetHead() regressions and execution unwinds.

Changes:

  • Implements spec-faithful get_voting_source selection by using unrealized checkpoints only for prior-epoch blocks and realized state checkpoints for current/future-epoch blocks.
  • Adds the spec’s lenient +2 >= current_epoch lookback condition for justified/finalized checkpoint filtering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cl/phase1/forkchoice/get_head.go
Comment thread cl/phase1/forkchoice/get_head.go Outdated
Comment thread cl/phase1/forkchoice/get_head.go Outdated
@sudeepdino008 sudeepdino008 force-pushed the sudeep/fix-getheadtree-spec-justification branch from e43413b to 9d225e6 Compare May 21, 2026 06:45
@sudeepdino008 sudeepdino008 requested a review from Copilot May 21, 2026 06:47
@sudeepdino008 sudeepdino008 force-pushed the sudeep/fix-getheadtree-spec-justification branch from 9d225e6 to f0172e9 Compare May 21, 2026 06:50
sudeepdino008 added a commit that referenced this pull request May 21, 2026
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent
for the finalized check but left two justified-side deviations from
the spec's filter_block_tree / get_voting_source:

1. Voting source selection used unrealized unconditionally with a
   realized fallback. Spec get_voting_source branches strictly on
   current_epoch > block_epoch — unrealized only for prior-epoch
   blocks; current-epoch blocks must use the block's realized
   current_justified_checkpoint.

2. correct_justified used voting_source.epoch >= justified.epoch with
   an is_previous_epoch_justified-gated +2 fallback. Spec is the
   three flat disjuncts: epoch == GENESIS, voting_source.epoch ==
   justified.epoch, or voting_source.epoch + 2 >= current_epoch.

The finalized ancestor-descent check from f3ad70f is preserved
unchanged.

Observed effect on bloatnet: epoch-boundary head regression that
caused ~30-block execution unwinds every ~6 minutes is eliminated.
Validated by 1 hour at chain tip on bloatnet covering 5+ epoch
boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0
on all 172 forkchoice updates in the window.

Refs: #21301, #21310

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread cl/phase1/forkchoice/get_head.go Outdated
sudeepdino008 added a commit that referenced this pull request May 21, 2026
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent
for the finalized check but left two justified-side deviations from
the spec's filter_block_tree / get_voting_source:

1. Voting source selection used unrealized unconditionally with a
   realized fallback. Spec get_voting_source branches strictly on
   current_epoch > block_epoch — unrealized only for prior-epoch
   blocks; current-epoch blocks must use the block's realized
   current_justified_checkpoint.

2. correct_justified used voting_source.epoch >= justified.epoch with
   an is_previous_epoch_justified-gated +2 fallback. Spec is the
   three flat disjuncts: epoch == GENESIS, voting_source.epoch ==
   justified.epoch, or voting_source.epoch + 2 >= current_epoch.

The finalized ancestor-descent check from f3ad70f is preserved
unchanged.

Observed effect on bloatnet: epoch-boundary head regression that
caused ~30-block execution unwinds every ~6 minutes is eliminated.
Validated by 1 hour at chain tip on bloatnet covering 5+ epoch
boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0
on all 172 forkchoice updates in the window.

Refs: #21301, #21310
The previous implementation deviated from the consensus spec's
filter_block_tree / get_voting_source in two ways, compounding to
reject valid current-epoch leaves at every epoch boundary:

1. Voting source selection used unrealized unconditionally. Spec
   get_voting_source picks unrealized only for prior-epoch blocks (to
   pull-up justification view); current-epoch blocks should use the
   block's realized state checkpoint.

2. The justified/finalized viability check was a strict Equal(). Spec
   correct_justified allows voting_source.epoch + 2 >= current_epoch
   as a lenient 2-epoch lookback.

Both fixes restore the spec-faithful behavior. Observed effect on
bloatnet: epoch-boundary head regression that caused ~30-block
execution unwinds every ~6 minutes is eliminated.

Refs: #21301
@sudeepdino008 sudeepdino008 force-pushed the sudeep/fix-getheadtree-spec-justification branch from f0172e9 to 2f2f6fb Compare May 21, 2026 06:59
@sudeepdino008 sudeepdino008 requested a review from Copilot May 21, 2026 07:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread cl/phase1/forkchoice/get_head.go
@sudeepdino008 sudeepdino008 merged commit f1e024a into performance May 21, 2026
45 checks passed
@sudeepdino008 sudeepdino008 deleted the sudeep/fix-getheadtree-spec-justification branch May 21, 2026 08:07
vh0rvath pushed a commit to vh0rvath/erigon that referenced this pull request May 21, 2026
…rigontech#21327)

Cherry-pick of erigontech#21310 to main.

## main-specific adaptations

- Conflict in `cl/phase1/forkchoice/get_head.go` resolved in favour of
erigontech#21310's spec-faithful version, replacing main's existing attempt (`>=`
check + previous-epoch-justified fallback) with the spec's three-clause
`correct_justified` and the `get_voting_source` conditional branching.
- Adapted `f.Ancestor(blockRoot, finalizedSlot)` →
`f.Ancestor(blockRoot, finalizedSlot).Root` since on main `Ancestor`
returns `ForkChoiceNode` (GLOAS / EIP-7732 change) rather than
`common.Hash`.
AskAlexSharov pushed a commit that referenced this pull request May 22, 2026
…heckpoints (#21339)

## Summary

Follow-up to #21310. Fixes a remaining spec deviation in `OnBlock`'s
prior-epoch `updateCheckpoints` call: per
[`compute_pulled_up_tip`](https://github.com/ethereum/consensus-specs/blob/master/specs/phase0/fork-choice.md),
the prior-epoch update should use the post-pull-up checkpoints (after
\`process_justification_and_finalization\` runs), not the realized
values that were restored onto the state for replay purposes.

## The bug

In \`cl/phase1/forkchoice/on_block.go\`:

\`\`\`go
ProcessJustificationBitsAndFinality(lastProcessedState, nil) // pull-up
runs here
unrealizedJustifications.Store(..,
lastProcessedState.CurrentJustifiedCheckpoint())
...
// restore pre-pull-up values onto lastProcessedState (for
replay/simulation)

lastProcessedState.SetCurrentJustifiedCheckpoint(currentJustifiedCheckpoint)
lastProcessedState.SetFinalizedCheckpoint(stateFinalized)
...
if blockEpoch < currentEpoch {
// BUG: now reads the just-restored pre-pull-up values, not the
pulled-up ones
updateCheckpoints(lastProcessedState.CurrentJustifiedCheckpoint(),
lastProcessedState.FinalizedCheckpoint())
}
\`\`\`

Effect: \`store.justified_checkpoint\` / \`store.finalized_checkpoint\`
get updated with realized values rather than the post-pull-up values for
late-arriving prior-epoch blocks. The store ends up one epoch boundary
behind what the spec would have.

## Fix

Capture the post-pull-up values into local variables right after
\`ProcessJustificationBitsAndFinality\` and use them consistently in:

1. \`unrealizedJustifications.Store\` /
\`unrealizedFinalizations.Store\` (already pull-up, no behavior change)
2. \`updateUnrealizedCheckpoints\` (already pull-up, no behavior change)
3. \`updateCheckpoints\` in the \`blockEpoch < currentEpoch\` branch
(**this is the actual fix**)

## Test plan

- [x] Build clean
- [x] Runtime: validated alongside #21310 on bloatnet at chain tip — 23+
FCUs all at \`lagSlots=0\`, no large unwinds, no filter rejects after
this change is applied
- [ ] CI green

Refs: #21301, #21310

## Already validated on main

This same change (commit `3cfb67d6ef`) is included in #21327
(`cp/21310-to-main` → `main`), where **all CI checks are passing**:
lint, unit tests, race tests, integration tests, kurtosis,
mainnet-rpc-integ, hive, reproducible build across linux/mac/windows.
See #21327.

So this PR is the same patch already exercised by CI on the main-branch
flavor of #21310 — applying it to `performance` for parity.
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 28, 2026
…mit (erigontech#21444)

## Summary

`ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine
and waited on a `done` channel that closes only after that goroutine
**fully returns** — i.e. after flush + commit + prune. For a
merge-extending fork at tip the forkchoice result is already sent on
`outcomeCh` *before* the commit (`ParallelStateFlushing` is on by
default), so the `<-done` wait threw that early-send away and blocked
the consensus client for the entire EL commit.

This removes the `<-done` wait: `UpdateForkChoice` returns as soon as
the result lands on `outcomeCh`.

## Live-tip A/B (this branch vs main)

Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what
Caplin's clstages loop blocks on), live mainnet tip, same
datadir/machine/instrumentation:

| FCU-response wall (ms) | n | min | p50 | p90 | max | mean |
|------------------------|--:|----:|----:|----:|----:|-----:|
| **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 |
| **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 |

~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main
min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary.
The `Timings: Forkchoice commit=…ms` line is unchanged — the commit
still costs the same, it just no longer blocks the consensus loop.
safe/finalized cadence is identical between branch and main, and wall
does not correlate with safe/finalized changes, so this is not about
forkchoice args.

Both A/B binaries are post-erigontech#21327, so the comparison isolates only the
`<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035
regression (both sides already have its fix).

## Scope / relationship to erigontech#21008

This is an **independent FCU-response-latency improvement**, not the fix
for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to
erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches
no `execution/execmodule/` code — its `getFilterBlockTree`
spec-deviation regressed `GetHead` at epoch boundaries and is itself
already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the
bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that
commit measured 100% in-sync), so the EL-side blocking this PR removes
is **not** what caused the 100→3% collapse. It is a real, separate cost
worth removing on its own merits: at tip it returns the consensus loop
to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the
commit overlaps a background merge/prune burst).

## Why it's safe

The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could
safely acquire the EL semaphore. That guarantee is preserved without it:

- In `updateForkChoice`, the `defer e.semaphore.Release(1)` is
registered **first**, so it runs **last** — after every cleanup defer
(`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that
subsequently acquires the semaphore necessarily observes fully-settled
state.
- `ExecutionClientDirect.ForkChoiceUpdate` already retries
`AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU
racing an in-flight commit resolves itself.
- release/3.4 has no `<-done` and is the proven-good baseline.
- Non-merge FCUs send their result at the end of `updateForkChoice`
anyway, so their timing is unchanged.

## Test plan

- [x] `execution/engineapi` `TestEngineApi*` block-production suite (the
FCU→AssembleBlock path `<-done` guarded)
- [x] `execution/execmodule/...`
- [x] `cl/beacon/handler` `TestCaplinBlockProduction*`
- [x] `make lint` clean
- [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms →
1.2 ms, 0 gap errors over an epoch

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants