execution/execmodule: return FCU result without waiting for flush+commit#21444
Merged
Conversation
UpdateForkChoice waited on a `done` channel that closes only after the forkchoice goroutine fully returns — i.e. after flush+commit+prune. For a merge-extending fork at tip the result is already sent on outcomeCh before the commit (ParallelStateFlushing), so this wait threw away that early-send and blocked the consensus client for the whole EL commit (~260 ms at tip, seconds under background contention). release/3.4 returns on the early-send and is ~1 ms at the same boundary. Return as soon as the result lands on outcomeCh. The semaphore is released last (its defer is registered first, so it runs after all cleanup defers), so any follow-up op that acquires the semaphore — AssembleBlock (which also has a retry loop on contention) or the next FCU — observes fully-settled state. Non-merge FCUs send their result at the end anyway, so their timing is unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Reduces Engine API ForkChoiceUpdated response latency by making ExecModule.UpdateForkChoice return immediately once updateForkChoice publishes the forkchoice outcome, instead of waiting for the forkchoice goroutine to fully finish (flush/commit/prune).
Changes:
- Remove the post-outcome
donesynchronization so FCU responses aren’t blocked by the EL flush+commit tail. - Update inline comments to document the new “return-on-outcome” behavior and the semaphore/cleanup ordering expectations.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
AskAlexSharov
approved these changes
May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ExecModule.UpdateForkChoicespawns the forkchoice work in a goroutine and waited on adonechannel that closes only after that goroutine fully returns — i.e. after flush + commit + prune. For a merge-extending fork at tip the forkchoice result is already sent onoutcomeChbefore the commit (ParallelStateFlushingis on by default), so the<-donewait threw that early-send away and blocked the consensus client for the entire EL commit.This removes the
<-donewait:UpdateForkChoicereturns as soon as the result lands onoutcomeCh.Live-tip A/B (this branch vs main)
Probe at the
ExecutionClientDirect.ForkChoiceUpdateboundary (what Caplin's clstages loop blocks on), live mainnet tip, same datadir/machine/instrumentation:~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary. The
Timings: Forkchoice commit=…msline is unchanged — the commit still costs the same, it just no longer blocks the consensus loop. safe/finalized cadence is identical between branch and main, and wall does not correlate with safe/finalized changes, so this is not about forkchoice args.Both A/B binaries are post-#21327, so the comparison isolates only the
<-donechange — the gap is not confounded by the #21008/#20035 regression (both sides already have its fix).Scope / relationship to #21008
This is an independent FCU-response-latency improvement, not the fix for the #21008 in-sync regression. The bisection attributes #21008 to #20035, which is entirely CL-side (
cl/phase1/forkchoice/*) and touches noexecution/execmodule/code — itsgetFilterBlockTreespec-deviation regressedGetHeadat epoch boundaries and is itself already fixed on main by #21310/#21327. The<-donewait predates the bisection's good commit19f44b15cd(it was added in #19981, and that commit measured 100% in-sync), so the EL-side blocking this PR removes is not what caused the 100→3% collapse. It is a real, separate cost worth removing on its own merits: at tip it returns the consensus loop to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the commit overlaps a background merge/prune burst).Why it's safe
The
<-donewas added (#19981) so a follow-upAssembleBlockcould safely acquire the EL semaphore. That guarantee is preserved without it:updateForkChoice, thedefer e.semaphore.Release(1)is registered first, so it runs last — after every cleanup defer (ResetPendingUpdates,ClearWithUnwind, overlayClose). Any op that subsequently acquires the semaphore necessarily observes fully-settled state.ExecutionClientDirect.ForkChoiceUpdatealready retriesAssembleBlock(30× 200 ms) on semaphore contention, so a proposer FCU racing an in-flight commit resolves itself.<-doneand is the proven-good baseline.updateForkChoiceanyway, so their timing is unchanged.Test plan
execution/engineapiTestEngineApi*block-production suite (the FCU→AssembleBlock path<-doneguarded)execution/execmodule/...cl/beacon/handlerTestCaplinBlockProduction*make lintclean🤖 Generated with Claude Code