Skip to content

Add automated tests for block production with execution requests#19981

Merged
yperbasis merged 6 commits into
mainfrom
yperbasis/caplin_block_production
Mar 20, 2026
Merged

Add automated tests for block production with execution requests#19981
yperbasis merged 6 commits into
mainfrom
yperbasis/caplin_block_production

Conversation

@yperbasis

@yperbasis yperbasis commented Mar 18, 2026

Copy link
Copy Markdown
Member

Summary

  • Enhance TestEngineApiBuiltBlockWithWithdrawalRequest to verify execution request contents (source address, validator pubkey, amount), not just presence
  • Add TestAssembleBlockWithWithdrawalRequest exercising ChainReaderWriterEth1.GetAssembledBlock — the interface Caplin uses in production (the code path fixed by PR Include execution requests in produced block #14326)
  • Add TestCaplinBlockProductionWithWithdrawalRequest where real Caplin code (produceBeaconBody) drives block production against a real Erigon EL: submits a withdrawal request to the EIP-7002 system contract, then Caplin calls ForkChoiceUpdateGetAssembledBlock → decodes execution requests into the beacon body

All three tests use real EL code (no mocks for the execution layer). The Caplin-driven test is verified to fail when the PR #14326 fix is reverted.

Closes #14337

🤖 Generated with Claude Code

yperbasis and others added 2 commits March 18, 2026 14:02
…requests (#14337)

Verify that EIP-7685 execution requests (withdrawal requests via the
EIP-7002 system contract) survive the full block production round-trip:
build → assemble → validate → extend chain.

- Enhance TestEngineApiBuiltBlockWithWithdrawalRequest to verify request
  contents (source address, validator pubkey, amount), not just presence.
- Add TestAssembleBlockWithWithdrawalRequest that exercises
  ChainReaderWriterEth1.GetAssembledBlock — the interface Caplin uses in
  production, and the code path fixed by PR #14326 for issue #14319.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test produceBeaconBody against a real Erigon execution layer: submit a
withdrawal request to the EIP-7002 system contract, then have Caplin's
actual block production code call ForkChoiceUpdate, GetAssembledBlock,
and decode execution requests into the beacon body. Verified to fail
when the PR #14326 fix is reverted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yperbasis and others added 3 commits March 18, 2026 14:31
UpdateForkChoice releases the execution semaphore via defer, so there
is a small window where AssembleBlock (called immediately after FCU
returns) can observe Busy before the defer runs. This race caused
TestCaplinBlockProductionWithWithdrawalRequest to fail with
"execution data is still syncing".

Retry up to 10 times (500ms total) before giving up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis force-pushed the yperbasis/caplin_block_production branch from 559ef6c to 4ac6a66 Compare March 19, 2026 11:00
…oice

updateForkChoice sends its result to a buffered channel, then runs
defers that clean up shared state (ResetPendingUpdates, ClearWithUnwind)
and release the execution semaphore. The caller can read from the channel
and call AssembleBlock before those defers complete — racing on the
semaphore and on shared state like currentContext.

Releasing the semaphore before sending doesn't help: the defers that
access shared state still run after the release, creating data races
with the next semaphore holder.

Fix: add a done channel closed when the goroutine fully returns
(including all defers). After receiving a non-Busy result,
UpdateForkChoice waits on done before returning. The timeout path
still returns Busy without waiting, preserving the async behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis force-pushed the yperbasis/caplin_block_production branch from 4ac6a66 to 903fb3a Compare March 19, 2026 11:16
@yperbasis yperbasis enabled auto-merge March 19, 2026 13:22
@yperbasis yperbasis added this pull request to the merge queue Mar 20, 2026
Merged via the queue into main with commit 20b362b Mar 20, 2026
35 checks passed
@yperbasis yperbasis deleted the yperbasis/caplin_block_production branch March 20, 2026 09:21
github-merge-queue Bot pushed a commit that referenced this pull request Mar 20, 2026
## Summary
- Track goroutines spawned by `ConnectCore`/`ConnectSentries` with a
`WaitGroup`; `TxPool.Run` defers `Wait()` so it blocks until they all
exit
- Replace bare `time.Sleep` calls in retry loops with `select` on
`ctx.Done()` so goroutines exit promptly on cancellation instead of
sleeping through a 3-second backoff

These goroutines were previously fire-and-forget: after context
cancellation, `Run()` would return via the errgroup while the fetch
goroutines were still in retry sleeps or blocking on streams. Downstream
cleanup (`DB.Close()`, etc.) could then race with them.

Found while investigating flaky
`TestCaplinBlockProductionWithWithdrawalRequest` in #19981.

## Test plan
- [x] `go test -race ./txnprovider/txpool/` passes
- [x] `go test -race -count=3 ./cl/beacon/handler/ -run
TestCaplinBlockProductionWithWithdrawalRequest` passes without goroutine
leak

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 28, 2026
…mit (erigontech#21444)

## Summary

`ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine
and waited on a `done` channel that closes only after that goroutine
**fully returns** — i.e. after flush + commit + prune. For a
merge-extending fork at tip the forkchoice result is already sent on
`outcomeCh` *before* the commit (`ParallelStateFlushing` is on by
default), so the `<-done` wait threw that early-send away and blocked
the consensus client for the entire EL commit.

This removes the `<-done` wait: `UpdateForkChoice` returns as soon as
the result lands on `outcomeCh`.

## Live-tip A/B (this branch vs main)

Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what
Caplin's clstages loop blocks on), live mainnet tip, same
datadir/machine/instrumentation:

| FCU-response wall (ms) | n | min | p50 | p90 | max | mean |
|------------------------|--:|----:|----:|----:|----:|-----:|
| **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 |
| **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 |

~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main
min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary.
The `Timings: Forkchoice commit=…ms` line is unchanged — the commit
still costs the same, it just no longer blocks the consensus loop.
safe/finalized cadence is identical between branch and main, and wall
does not correlate with safe/finalized changes, so this is not about
forkchoice args.

Both A/B binaries are post-erigontech#21327, so the comparison isolates only the
`<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035
regression (both sides already have its fix).

## Scope / relationship to erigontech#21008

This is an **independent FCU-response-latency improvement**, not the fix
for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to
erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches
no `execution/execmodule/` code — its `getFilterBlockTree`
spec-deviation regressed `GetHead` at epoch boundaries and is itself
already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the
bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that
commit measured 100% in-sync), so the EL-side blocking this PR removes
is **not** what caused the 100→3% collapse. It is a real, separate cost
worth removing on its own merits: at tip it returns the consensus loop
to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the
commit overlaps a background merge/prune burst).

## Why it's safe

The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could
safely acquire the EL semaphore. That guarantee is preserved without it:

- In `updateForkChoice`, the `defer e.semaphore.Release(1)` is
registered **first**, so it runs **last** — after every cleanup defer
(`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that
subsequently acquires the semaphore necessarily observes fully-settled
state.
- `ExecutionClientDirect.ForkChoiceUpdate` already retries
`AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU
racing an in-flight commit resolves itself.
- release/3.4 has no `<-done` and is the proven-good baseline.
- Non-merge FCUs send their result at the end of `updateForkChoice`
anyway, so their timing is unchanged.

## Test plan

- [x] `execution/engineapi` `TestEngineApi*` block-production suite (the
FCU→AssembleBlock path `<-done` guarded)
- [x] `execution/execmodule/...`
- [x] `cl/beacon/handler` `TestCaplinBlockProduction*`
- [x] `make lint` clean
- [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms →
1.2 ms, 0 gap errors over an epoch

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automated tests of block production with Caplin

2 participants