Skip to content

cl/phase1/forkchoice: align getFilterBlockTree with consensus spec#21327

Merged
domiwei merged 3 commits into
mainfrom
cp/21310-to-main
May 21, 2026
Merged

cl/phase1/forkchoice: align getFilterBlockTree with consensus spec#21327
domiwei merged 3 commits into
mainfrom
cp/21310-to-main

Conversation

@sudeepdino008

Copy link
Copy Markdown
Member

Cherry-pick of #21310 to main.

main-specific adaptations

  • Conflict in cl/phase1/forkchoice/get_head.go resolved in favour of cl/phase1/forkchoice: align getFilterBlockTree with consensus spec #21310's spec-faithful version, replacing main's existing attempt (>= check + previous-epoch-justified fallback) with the spec's three-clause correct_justified and the get_voting_source conditional branching.
  • Adapted f.Ancestor(blockRoot, finalizedSlot)f.Ancestor(blockRoot, finalizedSlot).Root since on main Ancestor returns ForkChoiceNode (GLOAS / EIP-7732 change) rather than common.Hash.

The previous implementation deviated from the consensus spec's
filter_block_tree / get_voting_source in two ways, compounding to
reject valid current-epoch leaves at every epoch boundary:

1. Voting source selection used unrealized unconditionally. Spec
   get_voting_source picks unrealized only for prior-epoch blocks (to
   pull-up justification view); current-epoch blocks should use the
   block's realized state checkpoint.

2. The justified/finalized viability check was a strict Equal(). Spec
   correct_justified allows voting_source.epoch + 2 >= current_epoch
   as a lenient 2-epoch lookback.

Both fixes restore the spec-faithful behavior. Observed effect on
bloatnet: epoch-boundary head regression that caused ~30-block
execution unwinds every ~6 minutes is eliminated.

Refs: #21301

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Aligns Caplin forkchoice head filtering (getFilterBlockTree) with the Ethereum consensus spec to avoid rejecting valid current-epoch leaves around epoch boundaries, which can otherwise cause head regressions and downstream execution unwinds.

Changes:

  • Snapshot the store’s currentEpoch once per tree-walk to avoid epoch-mixing across leaves when OnTick updates time concurrently.
  • Implement spec-faithful get_voting_source branching (unrealized only for prior-epoch blocks; realized checkpoint for current/future-epoch blocks).
  • Implement spec-faithful correct_justified (three-clause) and correct_finalized (finalized ancestor) checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@domiwei domiwei enabled auto-merge May 21, 2026 08:00
@domiwei domiwei added this pull request to the merge queue May 21, 2026
Merged via the queue into main with commit 419521c May 21, 2026
70 checks passed
@domiwei domiwei deleted the cp/21310-to-main branch May 21, 2026 12:28
AskAlexSharov pushed a commit that referenced this pull request May 22, 2026
…heckpoints (#21339)

## Summary

Follow-up to #21310. Fixes a remaining spec deviation in `OnBlock`'s
prior-epoch `updateCheckpoints` call: per
[`compute_pulled_up_tip`](https://github.com/ethereum/consensus-specs/blob/master/specs/phase0/fork-choice.md),
the prior-epoch update should use the post-pull-up checkpoints (after
\`process_justification_and_finalization\` runs), not the realized
values that were restored onto the state for replay purposes.

## The bug

In \`cl/phase1/forkchoice/on_block.go\`:

\`\`\`go
ProcessJustificationBitsAndFinality(lastProcessedState, nil) // pull-up
runs here
unrealizedJustifications.Store(..,
lastProcessedState.CurrentJustifiedCheckpoint())
...
// restore pre-pull-up values onto lastProcessedState (for
replay/simulation)

lastProcessedState.SetCurrentJustifiedCheckpoint(currentJustifiedCheckpoint)
lastProcessedState.SetFinalizedCheckpoint(stateFinalized)
...
if blockEpoch < currentEpoch {
// BUG: now reads the just-restored pre-pull-up values, not the
pulled-up ones
updateCheckpoints(lastProcessedState.CurrentJustifiedCheckpoint(),
lastProcessedState.FinalizedCheckpoint())
}
\`\`\`

Effect: \`store.justified_checkpoint\` / \`store.finalized_checkpoint\`
get updated with realized values rather than the post-pull-up values for
late-arriving prior-epoch blocks. The store ends up one epoch boundary
behind what the spec would have.

## Fix

Capture the post-pull-up values into local variables right after
\`ProcessJustificationBitsAndFinality\` and use them consistently in:

1. \`unrealizedJustifications.Store\` /
\`unrealizedFinalizations.Store\` (already pull-up, no behavior change)
2. \`updateUnrealizedCheckpoints\` (already pull-up, no behavior change)
3. \`updateCheckpoints\` in the \`blockEpoch < currentEpoch\` branch
(**this is the actual fix**)

## Test plan

- [x] Build clean
- [x] Runtime: validated alongside #21310 on bloatnet at chain tip — 23+
FCUs all at \`lagSlots=0\`, no large unwinds, no filter rejects after
this change is applied
- [ ] CI green

Refs: #21301, #21310

## Already validated on main

This same change (commit `3cfb67d6ef`) is included in #21327
(`cp/21310-to-main` → `main`), where **all CI checks are passing**:
lint, unit tests, race tests, integration tests, kurtosis,
mainnet-rpc-integ, hive, reproducible build across linux/mac/windows.
See #21327.

So this PR is the same patch already exercised by CI on the main-branch
flavor of #21310 — applying it to `performance` for parity.
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 28, 2026
…mit (erigontech#21444)

## Summary

`ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine
and waited on a `done` channel that closes only after that goroutine
**fully returns** — i.e. after flush + commit + prune. For a
merge-extending fork at tip the forkchoice result is already sent on
`outcomeCh` *before* the commit (`ParallelStateFlushing` is on by
default), so the `<-done` wait threw that early-send away and blocked
the consensus client for the entire EL commit.

This removes the `<-done` wait: `UpdateForkChoice` returns as soon as
the result lands on `outcomeCh`.

## Live-tip A/B (this branch vs main)

Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what
Caplin's clstages loop blocks on), live mainnet tip, same
datadir/machine/instrumentation:

| FCU-response wall (ms) | n | min | p50 | p90 | max | mean |
|------------------------|--:|----:|----:|----:|----:|-----:|
| **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 |
| **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 |

~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main
min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary.
The `Timings: Forkchoice commit=…ms` line is unchanged — the commit
still costs the same, it just no longer blocks the consensus loop.
safe/finalized cadence is identical between branch and main, and wall
does not correlate with safe/finalized changes, so this is not about
forkchoice args.

Both A/B binaries are post-erigontech#21327, so the comparison isolates only the
`<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035
regression (both sides already have its fix).

## Scope / relationship to erigontech#21008

This is an **independent FCU-response-latency improvement**, not the fix
for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to
erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches
no `execution/execmodule/` code — its `getFilterBlockTree`
spec-deviation regressed `GetHead` at epoch boundaries and is itself
already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the
bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that
commit measured 100% in-sync), so the EL-side blocking this PR removes
is **not** what caused the 100→3% collapse. It is a real, separate cost
worth removing on its own merits: at tip it returns the consensus loop
to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the
commit overlaps a background merge/prune burst).

## Why it's safe

The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could
safely acquire the EL semaphore. That guarantee is preserved without it:

- In `updateForkChoice`, the `defer e.semaphore.Release(1)` is
registered **first**, so it runs **last** — after every cleanup defer
(`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that
subsequently acquires the semaphore necessarily observes fully-settled
state.
- `ExecutionClientDirect.ForkChoiceUpdate` already retries
`AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU
racing an in-flight commit resolves itself.
- release/3.4 has no `<-done` and is the proven-good baseline.
- Non-merge FCUs send their result at the end of `updateForkChoice`
anyway, so their timing is unchanged.

## Test plan

- [x] `execution/engineapi` `TestEngineApi*` block-production suite (the
FCU→AssembleBlock path `<-done` guarded)
- [x] `execution/execmodule/...`
- [x] `cl/beacon/handler` `TestCaplinBlockProduction*`
- [x] `make lint` clean
- [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms →
1.2 ms, 0 gap errors over an epoch

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants