Add automated tests for block production with execution requests#19981
Merged
Conversation
…requests (#14337) Verify that EIP-7685 execution requests (withdrawal requests via the EIP-7002 system contract) survive the full block production round-trip: build → assemble → validate → extend chain. - Enhance TestEngineApiBuiltBlockWithWithdrawalRequest to verify request contents (source address, validator pubkey, amount), not just presence. - Add TestAssembleBlockWithWithdrawalRequest that exercises ChainReaderWriterEth1.GetAssembledBlock — the interface Caplin uses in production, and the code path fixed by PR #14326 for issue #14319. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test produceBeaconBody against a real Erigon execution layer: submit a withdrawal request to the EIP-7002 system contract, then have Caplin's actual block production code call ForkChoiceUpdate, GetAssembledBlock, and decode execution requests into the beacon body. Verified to fail when the PR #14326 fix is reverted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UpdateForkChoice releases the execution semaphore via defer, so there is a small window where AssembleBlock (called immediately after FCU returns) can observe Busy before the defer runs. This race caused TestCaplinBlockProductionWithWithdrawalRequest to fail with "execution data is still syncing". Retry up to 10 times (500ms total) before giving up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
559ef6c to
4ac6a66
Compare
2 tasks
…oice updateForkChoice sends its result to a buffered channel, then runs defers that clean up shared state (ResetPendingUpdates, ClearWithUnwind) and release the execution semaphore. The caller can read from the channel and call AssembleBlock before those defers complete — racing on the semaphore and on shared state like currentContext. Releasing the semaphore before sending doesn't help: the defers that access shared state still run after the release, creating data races with the next semaphore holder. Fix: add a done channel closed when the goroutine fully returns (including all defers). After receiving a non-Busy result, UpdateForkChoice waits on done before returning. The timeout path still returns Busy without waiting, preserving the async behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
4ac6a66 to
903fb3a
Compare
domiwei
approved these changes
Mar 20, 2026
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Mar 20, 2026
## Summary - Track goroutines spawned by `ConnectCore`/`ConnectSentries` with a `WaitGroup`; `TxPool.Run` defers `Wait()` so it blocks until they all exit - Replace bare `time.Sleep` calls in retry loops with `select` on `ctx.Done()` so goroutines exit promptly on cancellation instead of sleeping through a 3-second backoff These goroutines were previously fire-and-forget: after context cancellation, `Run()` would return via the errgroup while the fetch goroutines were still in retry sleeps or blocking on streams. Downstream cleanup (`DB.Close()`, etc.) could then race with them. Found while investigating flaky `TestCaplinBlockProductionWithWithdrawalRequest` in #19981. ## Test plan - [x] `go test -race ./txnprovider/txpool/` passes - [x] `go test -race -count=3 ./cl/beacon/handler/ -run TestCaplinBlockProductionWithWithdrawalRequest` passes without goroutine leak 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This was referenced May 26, 2026
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 28, 2026
…mit (erigontech#21444) ## Summary `ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine and waited on a `done` channel that closes only after that goroutine **fully returns** — i.e. after flush + commit + prune. For a merge-extending fork at tip the forkchoice result is already sent on `outcomeCh` *before* the commit (`ParallelStateFlushing` is on by default), so the `<-done` wait threw that early-send away and blocked the consensus client for the entire EL commit. This removes the `<-done` wait: `UpdateForkChoice` returns as soon as the result lands on `outcomeCh`. ## Live-tip A/B (this branch vs main) Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what Caplin's clstages loop blocks on), live mainnet tip, same datadir/machine/instrumentation: | FCU-response wall (ms) | n | min | p50 | p90 | max | mean | |------------------------|--:|----:|----:|----:|----:|-----:| | **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 | | **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 | ~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary. The `Timings: Forkchoice commit=…ms` line is unchanged — the commit still costs the same, it just no longer blocks the consensus loop. safe/finalized cadence is identical between branch and main, and wall does not correlate with safe/finalized changes, so this is not about forkchoice args. Both A/B binaries are post-erigontech#21327, so the comparison isolates only the `<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035 regression (both sides already have its fix). ## Scope / relationship to erigontech#21008 This is an **independent FCU-response-latency improvement**, not the fix for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches no `execution/execmodule/` code — its `getFilterBlockTree` spec-deviation regressed `GetHead` at epoch boundaries and is itself already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that commit measured 100% in-sync), so the EL-side blocking this PR removes is **not** what caused the 100→3% collapse. It is a real, separate cost worth removing on its own merits: at tip it returns the consensus loop to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the commit overlaps a background merge/prune burst). ## Why it's safe The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could safely acquire the EL semaphore. That guarantee is preserved without it: - In `updateForkChoice`, the `defer e.semaphore.Release(1)` is registered **first**, so it runs **last** — after every cleanup defer (`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that subsequently acquires the semaphore necessarily observes fully-settled state. - `ExecutionClientDirect.ForkChoiceUpdate` already retries `AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU racing an in-flight commit resolves itself. - release/3.4 has no `<-done` and is the proven-good baseline. - Non-merge FCUs send their result at the end of `updateForkChoice` anyway, so their timing is unchanged. ## Test plan - [x] `execution/engineapi` `TestEngineApi*` block-production suite (the FCU→AssembleBlock path `<-done` guarded) - [x] `execution/execmodule/...` - [x] `cl/beacon/handler` `TestCaplinBlockProduction*` - [x] `make lint` clean - [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms → 1.2 ms, 0 gap errors over an epoch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
TestEngineApiBuiltBlockWithWithdrawalRequestto verify execution request contents (source address, validator pubkey, amount), not just presenceTestAssembleBlockWithWithdrawalRequestexercisingChainReaderWriterEth1.GetAssembledBlock— the interface Caplin uses in production (the code path fixed by PR Include execution requests in produced block #14326)TestCaplinBlockProductionWithWithdrawalRequestwhere real Caplin code (produceBeaconBody) drives block production against a real Erigon EL: submits a withdrawal request to the EIP-7002 system contract, then Caplin callsForkChoiceUpdate→GetAssembledBlock→ decodes execution requests into the beacon bodyAll three tests use real EL code (no mocks for the execution layer). The Caplin-driven test is verified to fail when the PR #14326 fix is reverted.
Closes #14337
🤖 Generated with Claude Code