Skip to content

execution/execmodule: return FCU result without waiting for flush+commit#21444

Merged
yperbasis merged 2 commits into
mainfrom
yperbasis/fcu-no-done-wait
May 28, 2026
Merged

execution/execmodule: return FCU result without waiting for flush+commit#21444
yperbasis merged 2 commits into
mainfrom
yperbasis/fcu-no-done-wait

Conversation

@yperbasis

@yperbasis yperbasis commented May 27, 2026

Copy link
Copy Markdown
Member

Summary

ExecModule.UpdateForkChoice spawns the forkchoice work in a goroutine and waited on a done channel that closes only after that goroutine fully returns — i.e. after flush + commit + prune. For a merge-extending fork at tip the forkchoice result is already sent on outcomeCh before the commit (ParallelStateFlushing is on by default), so the <-done wait threw that early-send away and blocked the consensus client for the entire EL commit.

This removes the <-done wait: UpdateForkChoice returns as soon as the result lands on outcomeCh.

Live-tip A/B (this branch vs main)

Probe at the ExecutionClientDirect.ForkChoiceUpdate boundary (what Caplin's clstages loop blocks on), live mainnet tip, same datadir/machine/instrumentation:

FCU-response wall (ms) n min p50 p90 max mean
main 100 96.7 260.2 386.3 12160.8 390.1
this branch 82 0.7 1.2 3.0 4.3 1.7

~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary. The Timings: Forkchoice commit=…ms line is unchanged — the commit still costs the same, it just no longer blocks the consensus loop. safe/finalized cadence is identical between branch and main, and wall does not correlate with safe/finalized changes, so this is not about forkchoice args.

Both A/B binaries are post-#21327, so the comparison isolates only the <-done change — the gap is not confounded by the #21008/#20035 regression (both sides already have its fix).

Scope / relationship to #21008

This is an independent FCU-response-latency improvement, not the fix for the #21008 in-sync regression. The bisection attributes #21008 to #20035, which is entirely CL-side (cl/phase1/forkchoice/*) and touches no execution/execmodule/ code — its getFilterBlockTree spec-deviation regressed GetHead at epoch boundaries and is itself already fixed on main by #21310/#21327. The <-done wait predates the bisection's good commit 19f44b15cd (it was added in #19981, and that commit measured 100% in-sync), so the EL-side blocking this PR removes is not what caused the 100→3% collapse. It is a real, separate cost worth removing on its own merits: at tip it returns the consensus loop to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the commit overlaps a background merge/prune burst).

Why it's safe

The <-done was added (#19981) so a follow-up AssembleBlock could safely acquire the EL semaphore. That guarantee is preserved without it:

  • In updateForkChoice, the defer e.semaphore.Release(1) is registered first, so it runs last — after every cleanup defer (ResetPendingUpdates, ClearWithUnwind, overlay Close). Any op that subsequently acquires the semaphore necessarily observes fully-settled state.
  • ExecutionClientDirect.ForkChoiceUpdate already retries AssembleBlock (30× 200 ms) on semaphore contention, so a proposer FCU racing an in-flight commit resolves itself.
  • release/3.4 has no <-done and is the proven-good baseline.
  • Non-merge FCUs send their result at the end of updateForkChoice anyway, so their timing is unchanged.

Test plan

  • execution/engineapi TestEngineApi* block-production suite (the FCU→AssembleBlock path <-done guarded)
  • execution/execmodule/...
  • cl/beacon/handler TestCaplinBlockProduction*
  • make lint clean
  • Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms → 1.2 ms, 0 gap errors over an epoch

🤖 Generated with Claude Code

UpdateForkChoice waited on a `done` channel that closes only after the
forkchoice goroutine fully returns — i.e. after flush+commit+prune. For a
merge-extending fork at tip the result is already sent on outcomeCh before
the commit (ParallelStateFlushing), so this wait threw away that early-send
and blocked the consensus client for the whole EL commit (~260 ms at tip,
seconds under background contention). release/3.4 returns on the early-send
and is ~1 ms at the same boundary.

Return as soon as the result lands on outcomeCh. The semaphore is released
last (its defer is registered first, so it runs after all cleanup defers),
so any follow-up op that acquires the semaphore — AssembleBlock (which also
has a retry loop on contention) or the next FCU — observes fully-settled
state. Non-merge FCUs send their result at the end anyway, so their timing
is unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces Engine API ForkChoiceUpdated response latency by making ExecModule.UpdateForkChoice return immediately once updateForkChoice publishes the forkchoice outcome, instead of waiting for the forkchoice goroutine to fully finish (flush/commit/prune).

Changes:

  • Remove the post-outcome done synchronization so FCU responses aren’t blocked by the EL flush+commit tail.
  • Update inline comments to document the new “return-on-outcome” behavior and the semaphore/cleanup ordering expectations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yperbasis yperbasis marked this pull request as ready for review May 27, 2026 10:31
@yperbasis yperbasis requested a review from mh0lt as a code owner May 27, 2026 10:31
@yperbasis yperbasis requested review from Copilot and taratorio May 27, 2026 10:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

@yperbasis yperbasis enabled auto-merge May 27, 2026 10:36
@yperbasis yperbasis changed the title execution/execmodule: don't block FCU response on flush+commit execution/execmodule: return FCU result without waiting for flush+commit May 27, 2026
@yperbasis yperbasis added this to the 3.5.0 milestone May 27, 2026
@yperbasis yperbasis requested a review from AskAlexSharov May 28, 2026 09:06
@yperbasis yperbasis added this pull request to the merge queue May 28, 2026
Merged via the queue into main with commit 219e78e May 28, 2026
91 checks passed
@yperbasis yperbasis deleted the yperbasis/fcu-no-done-wait branch May 28, 2026 16:45
taratorio referenced this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants