caplin: unified Engine API client for standalone mode#20035
Conversation
df6b186 to
6ef537d
Compare
Caplin Kurtosis Suite — blocked on upstreamThe caplin-assertoor config ( What needs to happen upstream
Once upstream is readyRe-enable the suite in - suite: caplin
package_args: .github/workflows/kurtosis/caplin-assertoor.io
docker_binaries: "erigon caplin"
ethereum_package_branch: "<version-with-caplin-support>"And restore the |
|
Upstream issue for caplin launcher support: ethpandaops/ethereum-package#1337 |
There was a problem hiding this comment.
Bug: IsCanonicalHash checks existence, not canonicality (remote mode)
Severity: High
func (cc *ExecutionClientEngine) IsCanonicalHash(ctx context.Context, hash common.Hash) (bool, error) {
...
if err := cc.rpcClient.CallContext(ctx, &header, "eth_getBlockByHash", hash, false); err != nil {
...
return header != nil, nil
}
eth_getBlockByHash returns any block with that hash, including uncles and stale/reorged blocks. The implementation in rpc/jsonrpc/eth_block.go calls blockByHashWithSenders without a canonicality check. So this
returns true for non-canonical blocks.
This is used in engine_server.go:getQuickPayloadStatusIfPossible to decide if a block is already validated. Incorrect canonicality could lead to wrong VALID responses for non-canonical blocks.
Fix: Either use eth_getBlockByNumber to read the canonical block at the same height and compare hashes, or add a dedicated eth_isCanonical check, or use the block number from the returned header to fetch the
canonical hash at that height.
Bug: ForkChoiceUpdate hardcodes ForkchoiceUpdatedV3
Severity: High
func (cc *ExecutionClientEngine) ForkChoiceUpdate(...) ([]byte, error) {
...
resp, err := cc.engine.ForkchoiceUpdatedV3(ctx, forkChoiceState, attributes)
The Engine API spec requires version-specific FCU calls:
- V1 for Bellatrix
- V2 for Capella
- V3 for Deneb+
EngineServer.forkchoiceUpdated validates version against the timestamp (engine_server.go:714-727) and returns UnsupportedForkError if V3 is used with pre-Cancun blocks. In remote mode against a third-party EL,
this would also fail.
The PR already does correct version selection for NewPayload (switch on payload.Version()). ForkChoiceUpdate needs the same treatment — the caller needs to pass or derive the fork version. Currently the method
signature has no version parameter, so this requires an interface change or deriving the version from the head block's timestamp.
Memory leak: unrealizedJustifications / unrealizedFinalizations never cleaned up
Severity: High
// In ForkChoiceStore:
unrealizedJustifications sync.Map // blockRoot -> solid.Checkpoint
unrealizedFinalizations sync.Map // blockRoot -> solid.Checkpoint
These are populated in OnBlock with every new block but have no deletion path. The existing sync.Map fields (childrens, checkpointStates) are cleaned in onNewFinalized() (in forkchoice/utils.go:62-78) when
blocks get finalized. These new maps need the same treatment.
Growth: ~350 bytes/entry × 7,200 blocks/day = ~2.5 MB/day, ~900 MB/year.
Fix: Add cleanup in onNewFinalized() that deletes entries for blocks at or before the finalized epoch's start slot, matching the pattern used for childrens.
Bug: EngineAPIRPCClient missing ForkchoiceUpdatedV4
Severity: Medium (compile error if the interface requires it)
The EngineAPI interface in execution/engineapi/interface.go may require ForkchoiceUpdatedV4. The EngineAPIRPCClient only implements V1-V3. If the interface was updated to include V4 (for Fulu/Osaka), this would
be a compile error. Verify against the current interface definition.
Correctness: getQuickPayloadStatusIfPossible behavior change
Severity: Medium
// Old:
if currentHeader != nil && blockHash != currentHeader.Hash() && header != nil && isCanonical {
// New:
if newPayload && currentHeader != nil && header != nil && isCanonical {
The old code had blockHash != currentHeader.Hash() to avoid short-circuiting when FCU points to the current head (it needs to go through HandleForkChoice to call writeForkChoiceHashes). The new code uses
newPayload flag instead, which is cleaner. But verify: for the FCU case, the old code would still return VALID early if blockHash == currentHeader.Hash() was false — now it never returns early for FCU. This
means FCU always goes through the full pipeline even for already-canonical non-head blocks. This is correct but slower.
Also, the re-fetch of parent inside waitForResponse:
if parent == nil {
parent = s.chainRW.GetHeaderByHash(ctx, parentHash)
}
This fixes a real bug where a stale parent captured before an uncommitted write tx would cause false SYNCING responses. Good fix.
Design: HasBlock remote mode also uses eth_getBlockByHash
Severity: Low (same root issue as IsCanonicalHash)
func (cc *ExecutionClientEngine) HasBlock(ctx context.Context, hash common.Hash) (bool, error) {
...
return header != nil, nil
}
HasBlock checking existence via eth_getBlockByHash is semantically correct here (it asks "do you have this block?", not "is it canonical?"). This is fine.
Design: writeGenesisBeaconBlock — well-reasoned but fragile
Severity: Low
The genesis beacon block reconstruction from state is carefully documented and the assumption (empty txs/withdrawals → HashSSZ() == HeaderHashSSZ()) is correct for genesis. However, the comment notes this
invariant explicitly, which is good.
One concern: if a non-standard genesis (e.g., a devnet with genesis transactions) breaks this assumption, body_root would mismatch silently. Consider adding a verification:
bodyRoot, _ := body.HashSSZ()
if bodyRoot != header.BodyRoot {
return fmt.Errorf("genesis body root mismatch: computed %x, expected %x", bodyRoot, header.BodyRoot)
}
Correctness: Eth1Block.MarshalJSON version-gating
Severity: Low (good fix)
The old code always emitted blob_gas_used and excess_blob_gas (as "0") for pre-Deneb blocks. The new code correctly version-gates:
- Bellatrix: base fields only
- Capella: + withdrawals
- Deneb+: + withdrawals + blob gas fields
This matches the Beacon API spec. The test fixture update (block_1.json) follows correctly.
Nit: Typo in engine_api_rpc_client.go
isHTTPpecified := strings.HasPrefix(addr, "http") // typo: missing 'S'
isHTTPSpecified := strings.HasPrefix(addr, "https")
The variable isHTTPpecified should be isHTTPSpecified (or better: hasScheme). Also, the logic is subtly wrong — strings.HasPrefix(addr, "http") matches both http:// and https://, so isHTTPpecified is always
true when isHTTPSpecified is true. The protocol selection still works correctly due to the else if !isHTTPpecified guard, but the names are confusing.
Nit: Duplicate comments in get_head.go
Lines 3506-3508:
// Use per-block unrealized justifications (spec: store.unrealized_justifications[block_root])
// Fall back to realized checkpoints if unrealized not available
// Use per-block unrealized justifications (spec: store.unrealized_justifications[block_root])
// Fall back to realized checkpoints if unrealized not available
The comment block is duplicated.
Nit: getAssembledBlockV4 and V5 are identical
Both methods have exactly the same body — they both parse ExecutionRequests the same way. They could share a helper, or V5 could delegate to V4 (assuming the response format is the same for both API versions).
Summary
│ Issue │ Severity │ Type │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ IsCanonicalHash checks existence not canonicality │ High │ Bug │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ ForkChoiceUpdate hardcodes V3 │ High │ Bug │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ unrealizedJustifications/unrealizedFinalizations memory leak │ High │ Memory leak │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ EngineAPIRPCClient possibly missing ForkchoiceUpdatedV4 │ Medium │ Compile error │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ getQuickPayloadStatusIfPossible behavior change │ Medium │ Verify │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ Genesis beacon block body root not verified │ Low │ Robustness │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ Eth1Block.MarshalJSON version-gating │ Low │ Good fix │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ URL parsing typo / confusing variable names │ Nit │ Style │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ Duplicate comments in get_head.go │ Nit │ Style │
├──────────────────────────────────────────────────────────────┼──────────┼───────────────┤
│ getAssembledBlockV4/V5 duplicate bodies │ Nit │ DRY │
└──────────────────────────────────────────────────────────────┴──────────┴───────────────┘
The three high-severity issues (IsCanonicalHash semantics, FCU version hardcoding, sync.Map leak) should be addressed before merge. The architectural approach — unified ExecutionClientEngine with local/remote
modes, using EngineAPI as the transport boundary — is well-designed and a clear improvement over the previous panicking ExecutionClientRPC.
Replace the broken ExecutionClientRPC (8 panicking methods, HTTP overhead for internal use) with a unified ExecutionClientEngine that works in two modes: - Local: calls EngineServer methods directly (zero HTTP/JWT/JSON overhead) for --caplin.use-engine-api internal mode - Remote: uses EngineAPIRPCClient over HTTP JSON-RPC with JWT auth for standalone cmd/caplin The EngineAPI interface serves as the transport boundary — EngineServer satisfies it in-process, EngineAPIRPCClient satisfies it over HTTP.
…one mode Fix multiple bugs preventing Caplin from producing blocks and finalizing in standalone mode (with external VC). Verified with kurtosis devnet (assertoor synchronized-check + block-proposal-check both pass) and hive engine tests (401/403 pass, 2 known Cancun failures). Bugs fixed: - Genesis block root mismatch (stored under content hash vs state root) - GetPayload version routing (was always V4, now version-aware) - Missing withdrawals/transactions in executionPayloadToEth1Block - Parent root mismatch in produced blocks (HashSSZ vs HeadRoot) - Fulu NewPayload routing (added FuluVersion case) - EL NewPayload quick-reject dropping valid blocks (re-fetch parent inside waitForResponse closure) - Fork choice filter: sync.Map type mismatch ([32]byte vs common.Hash) - Fork choice filter: missing unrealized finalized checkpoints causing leaf rejection at epoch boundaries - Beacon API: withdrawals field omitted from execution_payload JSON (omitempty on nil pointer); now version-gated MarshalJSON
Add caplin assertoor test suite to kurtosis CI matrix with per-suite docker_binaries and ethereum_package_branch support. Document the full hive and kurtosis test setup for caplin+erigon pairs, covering local runs, CI integration, and the embedded engine API mode.
- Add internalCL field to EngineServer to suppress "no CL requests" warning when any embedded CL is active (including --caplin.use-engine-api) - Re-fetch parent header inside waitForResponse closure to avoid false SYNCING responses when the parent is in an uncommitted write tx
…ization The Eth1Block.MarshalJSON change correctly omits Deneb-specific fields (blob_gas_used, excess_blob_gas) for pre-Deneb blocks. Update the test fixture to match.
- Add missing `consuming` parameter to NewEngineServer calls in tests - Update produceBeaconBody call to use new (slot, root) signature
The caplin kurtosis suite requires cl_type support in ethereum-package which doesn't exist yet. Revert workflow changes to match main and keep the caplin config file for future use.
57da39c to
ce72ee0
Compare
- IsCanonicalHash: verify canonicality via eth_getBlockByNumber cross-check instead of just checking block existence with eth_getBlockByHash - ForkChoiceUpdate: add version parameter to select V1-V4 based on fork, matching the existing NewPayload version selection pattern - ForkchoiceUpdatedV4: add to EngineAPI interface and EngineAPIRPCClient - forkchoice: clean up unrealizedJustifications/unrealizedFinalizations sync.Maps in onNewFinalized to prevent unbounded memory growth - clstages: add genesis beacon block body root verification - Fix isHTTPpecified typo, rename to hasScheme/isHTTPS for clarity - Remove duplicate comments in get_head.go - Extract getAssembledBlockFromResponse shared helper for V4/V5
In standalone caplin mode, the execution block reader (eth1Getter) was nil, causing the beacon API to return null for execution_payload.transactions when reading blocks. This made assertoor's transaction inclusion tests fail. Add ExecutionEngineReader that implements ExecutionBlockReaderByNumber by fetching block bodies from the EL via engine_getPayloadBodiesByRangeV1. Wire it up in standalone caplin's main.go.
The beacon API spec defines previous_epoch_participation and
current_epoch_participation as List[ParticipationFlags] — a JSON array
of decimal strings (e.g. ["7", "7", "0"]).
Caplin was serializing them as hex-encoded SSZ bytes ("0x070700..."),
which caused assertoor to fail with "illegal base64 data" when trying
to load the validator set, breaking client tracking and causing the
transaction inclusion test to time out.
UnmarshalJSON supports both the new spec-compliant array format and
the legacy hex format for backwards compatibility.
The beacon DB stores blinded blocks (no transactions). When the beacon API reads a block, it fetches transactions from the EL via engine_getPayloadBodiesByRangeV1. But for recently produced blocks, the EL hasn't committed yet — causing the API to return 0 transactions. Add an in-memory body cache to ExecutionEngineReader. Block production populates the cache immediately, so the beacon API can return transactions before the EL commit completes. This fixes the assertoor eoa-transactions-test which checks block proposals via the beacon API for transaction inclusion.
yperbasis
left a comment
There was a problem hiding this comment.
Additional Issues Claude Found:
Medium: ExecutionClientRpc.GetAssembledBlock still panics
cl/phase1/execution_client/execution_client_rpc.go:288 — The old ExecutionClientRPC still has panic("unimplemented") in GetAssembledBlock. While this path shouldn't be hit if users are migrated to the new
ExecutionClientEngine, leaving a panic in production code is risky. Either implement it or return a clear error.
Medium: ExecutionEngineReader.NewExecutionEngineReader accepts nil engine
cl/persistence/format/snapshot_format/getters/execution_rpc.go — NewExecutionEngineReader is called with executionEngine in cmd/caplin/main.go:121, which could be nil if cfg.RunEngineAPI is false. The reader
will then panic on r.engine.GetBodiesByRange() when Transactions() or Withdrawals() is called and there's no cache hit. Either guard for nil engine or don't create the reader when engine is nil.
Low: CacheBody eviction is O(n)
getters/execution_rpc.go:85-90 — The eviction loop iterates over the entire map. With the cap at 64 entries this is fine in practice, but a minKey tracking approach would be cleaner. Not blocking.
Low: Block production caches transactions but not withdrawals
cl/beacon/handler/block_production.go:795 — CacheBlockBody is called with nil withdrawals in the builder body, while storeBlockAndBlobs (line 1349) also passes nil for withdrawals. If the block reader needs
withdrawals before the EL commits, this will fall through to the engine call and potentially fail. The execution payload has withdrawals available — consider passing them.
Low: ReadBlockByRoot nil check removed
db/snapshotsync/freezeblocks/beacon_block_reader.go:212 — The r.eth1Getter == nil guard was removed from ReadBlockByRoot. If eth1Getter is nil and the code path reaches snapshot deserialization that needs
execution data, this could panic. The new nil check in snapshot_format/blocks.go:117 mitigates this for the snapshot read path, but verify all callers are covered.
Nit: Copyright year
New files have Copyright 2024 but this is 2026 code.
Verdict
The core architecture is solid and the major review feedback has been addressed. The remaining issues are medium/low severity. I'd recommend addressing the nil-engine guard and the panic in GetAssembledBlock
before merge. The PR is close to ready.
- ExecutionClientRpc: replace panic("unimplemented") with ErrNotSupported
error returns for all stub methods
- ExecutionEngineReader: guard for nil engine (returns empty when no
engine available instead of panicking), add highestCached for O(1)
eviction threshold
- Block production: cache both transactions and withdrawals (not just
transactions) via shared cacheExecutionBody helper
- Fix copyright year to 2026 in new file
yperbasis
left a comment
There was a problem hiding this comment.
High Severity
- executionPayloadToEth1Block does not set version or beaconCfg on returned Eth1Block
cl/phase1/execution_client/execution_client_engine.go ~line 3582 in the diff:
block := &cltypes.Eth1Block{
ParentHash: ep.ParentHash,
...
}
The version and beaconCfg fields on Eth1Block are unexported and default to zero/nil. This means:
- block.Version() always returns 0 (Phase0), regardless of actual fork
- MarshalJSON() will never emit withdrawals, blob_gas_used, or excess_blob_gas
- EncodingSizeSSZ() / getSchema() will undercount / omit fields
- PayloadHeader() won't compute withdrawalsRoot or set blob gas fields
- Any method accessing b.beaconCfg (e.g., SSZ encode/decode) will nil-panic
This only affects the remote EL path (standalone caplin). The local path goes through chainRW.GetAssembledBlock which returns properly versioned blocks. But for the stated goal of enabling external EL usage,
this is a real bug.
Fix: Use cltypes.NewEth1Block(version, beaconCfg) then set fields, or add a package-level constructor. The version is available from the clparams.StateVersion parameter already threaded through
GetAssembledBlock.
- FCU version mapping sends V4 for Electra/Fulu — should be V3
cl/phase1/execution_client/execution_client_engine.go FCU switch:
default: // Electra, Fulu, Gloas+
resp, err = cc.engine.ForkchoiceUpdatedV4(ctx, ...)
Per the Engine API spec:
- forkchoiceUpdatedV3 is valid for Cancun and Prague (Deneb + Electra)
- forkchoiceUpdatedV4 is introduced in Amsterdam (Gloas)
Sending V4 for Electra causes the server to process it as GloasVersion (see engine_api_methods.go:142), which could mis-validate payload attributes. The same issue exists in execution_client_rpc.go where
Electra also maps to V4.
Against a non-erigon EL, this would fail — a spec-compliant EL will reject forkchoiceUpdatedV4 before Amsterdam activates.
Medium Severity
- NewPayload does not nil-check payloadStatus before dereferencing
execution_client_engine.go ~line 3328:
if payloadStatus.Status == engine_types.AcceptedStatus {
If the engine returns (nil, nil), this panics. The existing ExecutionClientRpc avoids this by pre-allocating the status struct. Low probability in practice but worth a nil guard.
- Memory leak in unrealized checkpoint maps
cl/phase1/forkchoice/utils.go cleanup in onNewFinalized:
The cleanup only deletes entries where GetHeader returns true and header.Slot <= finalizedSlot. If the fork graph has already pruned a header, GetHeader returns false and the map entry is never deleted. Over a
long-running node, orphaned entries accumulate. Fix:
if !has || header.Slot <= finalizedSlot {
f.unrealizedJustifications.Delete(k)
}
- unrealizedFinalizations map is not in the consensus spec
cl/phase1/forkchoice/forkchoice.go: The spec defines store.unrealized_justifications but not store.unrealized_finalizations. The spec's filter_block_tree uses the block's realized finalized checkpoint for the
finalized check. Using unrealized finalized checkpoints is a spec deviation. In practice this should be safe (unrealized >= realized), but it should be verified against edge cases at epoch boundaries.
- GetBlobs always uses V1 in remote mode
execution_client_engine.go ~line 3673:
result, err := cc.engine.GetBlobsV1(ctx, versionedHashes)
The EngineAPI interface has GetBlobsV1, V2, V3 and the constants are defined, but GetBlobs never dispatches to V2/V3. For Fulu+, GetBlobsV2 returns CellProofs []hexutil.Bytes instead of a single Proof, so using
V1 would return the wrong proof structure.
- No Close() method on ExecutionClientEngine
The rpc.Client created in NewExecutionClientEngineRPC holds an HTTP connection pool and goroutines. There's no cleanup path when the engine is shut down. The other implementations have the same gap, but this
adds another one.
Low Severity
- Hardcoded maxWithdrawals = 16
executionPayloadToEth1Block:
block.Withdrawals = solid.NewStaticListSSZ[*cltypes.Withdrawal](16, 44)
Mainnet MaxWithdrawalsPerPayload is 16, but this should use the beacon config value for correctness on testnets/custom chains. Same issue in getters/execution_rpc.go.
- Hardcoded DenebVersion in block collector FCU
cl/phase1/execution_client/block_collector/persistent_block_collector.go:280:
p.engine.ForkChoiceUpdate(ctx, ..., clparams.DenebVersion)
The block's version is already decoded at line 223 (version := clparams.StateVersion(v[0])). Could propagate it instead of hardcoding.
- Dead constants
rpc_helper/methods.go: EngineNewPayloadV5, EngineGetBlobsV2, EngineGetBlobsV3 are defined but never referenced. Should be removed or documented as future-use.
- writeGenesisBeaconBlock doesn't handle Electra+ fields
cl/phase1/stages/clstages.go: The synthetic genesis block reconstruction handles fields up to Deneb. If genesis is at Electra version, body.ExecutionRequests may need initialization for the body root to match.
- Anchor block root not stored in unrealized maps
forkchoice.go:291-294: NewForkChoiceStore sets the atomic checkpoint values but never populates unrealizedJustifications.Store(anchorRoot, ...). Mitigated by the fallback to fork graph lookups.
Summary
The PR's core architecture is good, but issues #1 and #2 are blockers — the missing version/beaconCfg on Eth1Block will cause panics or wrong serialization in remote mode, and the FCU version mismatch will
break against non-erigon ELs. Issue #4 (memory leak) should also be fixed before merge. The rest can be addressed in follow-ups.
HIGH: - executionPayloadToEth1Block: use NewEth1Block(version, beaconCfg) so the returned block has correct version and config for serialization, SSZ encoding, and payload header computation in remote mode - FCU version mapping: V3 covers Deneb + Electra + Fulu (Prague), V4 is only for Gloas+ (Amsterdam). Previously Electra/Fulu sent V4 which would be rejected by spec-compliant non-erigon ELs MEDIUM: - NewPayload: add nil check for payloadStatus before dereferencing - Unrealized checkpoint cleanup: also delete entries whose headers have been pruned from the fork graph (!has), preventing orphaned map entries - Thread beaconCfg through ExecutionClientEngine constructors and SetBeaconChainConfig for deferred initialization LOW: - Use beaconCfg.MaxWithdrawalsPerPayload instead of hardcoded 16 - Improve block collector FCU comment documenting V3 coverage
- GetBlobs: add version-aware dispatch — V1 for Deneb/Electra (single proof), V2 for Fulu+ (cell proofs). Adds version parameter to ExecutionEngine.GetBlobs interface - ExecutionClientEngine: add Close() to release HTTP connections - Remove dead EngineNewPayloadV5 constant from rpc_helper/methods.go - Store anchor block root in unrealizedJustifications/Finalizations maps during ForkChoiceStore initialization - writeGenesisBeaconBlock: no change needed — NewSignedBeaconBlock already initializes ExecutionRequests for Electra+ via NewBeaconBody
yperbasis
left a comment
There was a problem hiding this comment.
HIGH
- ExecutionClientRpc.NewPayload missing Fulu case (execution_client_rpc.go)
The old ExecutionClientRpc.NewPayload has // TODO: Add Fulu case but no actual Fulu handling — it falls through to return PayloadStatusNone, errors.New("invalid payload version"). The new ExecutionClientEngine
correctly handles Fulu (uses V4 for Electra+Fulu), but if any code path still constructs an ExecutionClientRpc, Fulu payloads will fail. If this type is now dead code, it should be removed or clearly marked as
deprecated. If it's still reachable, the TODO needs to be resolved.
- GetBlobs silently swallows errors (execution_client_engine.go)
Both local and remote paths return nil, nil on error:
result, err := cc.engine.GetBlobsV1(ctx, versionedHashes)
if err != nil {
return nil, nil
}
This makes "no blobs available" indistinguishable from "network failure" to callers. For blob availability checks this could mask real problems. The interface signature (([][]byte, [][][]byte) with no error)
forces this, but it's worth considering whether the interface should return an error.
- onNewFinalized uint64 underflow (forkchoice/utils.go)
slotToPrune := ((newFinalized.Epoch - 3) * f.beaconCfg.SlotsPerEpoch) - 1
If newFinalized.Epoch < 3 (which happens during the first 3 epochs after genesis), this wraps around to ^uint64(0). Pre-existing bug, but this PR adds genesis-aware startup code that will hit this path. Needs a
guard:
if newFinalized.Epoch <= 3 {
return // or slotToPrune = 0
}
MEDIUM
- ExecutionClientRpc should be removed or clearly deprecated
The old type now returns ErrNotSupported for most methods instead of panicking (improvement), but it's effectively dead. ExecutionClientEngine fully supersedes it in both local and remote modes. Keeping it
around adds confusion about which implementation is canonical.
- CacheBlockBody relies on concrete type assertion (beacon_block_reader.go:81-82)
if c, ok := r.eth1Getter.(*getters.ExecutionEngineReader); ok {
c.CacheBody(blockNumber, transactions, withdrawals)
}
This silently no-ops if eth1Getter is any other type. The CacheBlockBody method is on the BeaconSnapshotReader interface but only works with one implementation. Consider adding CacheBody to the
ExecutionBlockReaderByNumber interface instead.
- writeGenesisBeaconBlock is fragile across forks (clstages.go)
The genesis block reconstruction copies Eth1Header fields into Eth1Block one by one. Any new field added in a future fork will be silently missed, causing a body root mismatch that fails with a cryptic error.
Consider adding a constructor like NewEth1BlockFromHeader(header *Eth1Header) to keep this in sync.
- Per-block unrealized checkpoint maps: key type (forkchoice.go:109-110)
unrealizedJustifications sync.Map // blockRoot -> solid.Checkpoint
unrealizedFinalizations sync.Map // blockRoot -> solid.Checkpoint
The commit message mentions fixing a sync.Map type mismatch ([32]byte vs common.Hash). The code stores/loads common.Hash consistently now, which is correct. The earlier fix was critical — glad it's resolved.
LOW
-
Copyright years — execution_client_engine.go and engine_api_rpc_client.go are new files but have Copyright 2024. Should be 2026 per the project convention (as execution_rpc.go correctly uses).
-
FCU version mapping is correct and well-commented — V3 covers Deneb through Fulu (Prague), V4 only for Gloas+ (Amsterdam). Matches the Engine API spec.
-
ParticipationBitList JSON fix is spec-compliant — The old hex-encoded SSZ was wrong per the beacon API spec. The new MarshalJSON produces ["7","7","0"] format, and UnmarshalJSON accepts both formats for
backwards compat. Good. -
Eth1Block.MarshalJSON version-gated output — Correctly omits blob_gas_used/excess_blob_gas for pre-Deneb and ensures withdrawals is never null for Capella+. This fixes the omitempty nil-pointer issue.
Verdict
Solid work that addresses a real gap (caplin couldn't function as a standalone CL). The main concerns are: (1) cleaning up the now-superseded ExecutionClientRpc, (2) the uint64 underflow in onNewFinalized, and
(3) silent error swallowing in GetBlobs. The rest is clean, well-documented, and battle-tested against assertoor.
HIGH: - Remove ExecutionClientRpc entirely — fully superseded by ExecutionClientEngine. Move shared checkPayloadStatus and DefaultRPCHTTPTimeout to execution_client_engine.go. - Fix uint64 underflow in onNewFinalized when epoch < 3: guard against wraparound during first 3 epochs after genesis. MEDIUM: - Move CacheBody to ExecutionBlockReaderByNumber interface instead of relying on concrete type assertion. All implementations (ExecutionEngineReader, ExecutionSnapshotReader) now satisfy it. - Add NewEth1BlockFromExecutionHeader constructor that keeps header→block field mapping in sync across forks. Simplifies writeGenesisBeaconBlock from 25 lines to 1. LOW: - Fix copyright years on new files (2024 → 2026)
The GetBlobs interface previously returned (blobs, proofs) with no error, forcing implementations to silently swallow network failures and return nil — making "no blobs available" indistinguishable from "network error". Add error return to ExecutionEngine.GetBlobs. The caller in on_block.go now logs warnings on failure instead of silently treating errors as missing blobs.
yperbasis
left a comment
There was a problem hiding this comment.
HIGH
-
ExecutionClientEngine.Close() is not called on the local path (execution_client_engine.go:109-111)
-
Close() only closes cc.rpcClient and is a no-op for local mode, which is fine for now. But ExecutionClientEngine is also not closed on the remote path in cmd/caplin/main.go -- there's no defer cc.Close(). If
the caplin process exits ungracefully, HTTP connections won't be cleaned up. Consider adding defer cc.Close() in main.go after construction. -
NewExecutionClientEngineRPC passes nil beaconCfg (cmd/caplin/main.go:80)
cc, err := execution_client.NewExecutionClientEngineRPC(cfg.JwtSecret, cfg.EngineAPIAddr, cfg.EngineAPIPort, nil)
-
beaconCfg is nil at construction time. It's set later via SetBeaconChainConfig, but if any method on ExecutionClientEngine is called before that (e.g., GetAssembledBlock which calls
executionPayloadToEth1Block which uses beaconCfg), you'd get a nil pointer dereference. This is a latent crash risk. Consider either:
- Requiring beaconCfg at construction (preferred)
- Adding nil-guards in executionPayloadToEth1Block -
FCU version mapping may reject future forks (execution_client_engine.go:216-224)
-
The default case in ForkChoiceUpdate sends V4 for anything >= Gloas. If a new fork is added but the Engine API adds V5, the default would silently use V4. This is acceptable for now but should be revisited
when new forks are defined.
MEDIUM
-
ExecutionEngineReader.cache has no upper bound enforcement guarantee (getters/execution_rpc.go:92-97)
-
The eviction logic only triggers when len(r.cache) > 64, but it deletes entries based on r.highestCached - 64. If block numbers are sparse (e.g., blocks cached for numbers 1, 100, 200), the threshold-based
eviction won't clean entries efficiently. For the expected use case (sequential block production) this is fine, but worth noting. -
Race between CacheBody and Transactions/Withdrawals (getters/execution_rpc.go)
-
The getCached method uses RLock and CacheBody uses full Lock, which is correct. However, the pattern of "check cache, then fall through to engine" could return stale data if a block is re-orged and
re-produced at the same number. Given the 64-block window and the fact this is for beacon API reads, the practical risk is very low. -
ParticipationBitList.MarshalJSON allocates per-element strings (solid/participation_bitlist.go:180-184)
-
For large validator sets (1M+ validators), this creates 1M+ string allocations. Consider using a strings.Builder or pre-allocated byte buffer. Not a blocker, but worth noting for mainnet performance.
LOW
-
EngineAPIRPCClient URL construction (engine_api_rpc_client.go:41-49)
-
The scheme detection logic is correct but slightly convoluted. hasScheme checks http which would also match https, then isHTTPS refines it. The empty string for protocol when hasScheme is true works because
the address already contains the scheme. This is fine but could benefit from a brief comment. -
buildExecutionPayload reverses BaseFeePerGas manually (execution_client_engine.go:117-121)
-
This same byte-reversal pattern appears in executionPayloadToEth1Block and in eth1_block.go. Consider extracting a helper to avoid the repeated reversal logic (3 occurrences). Not critical for this PR.
-
Test fixture (block_1.json) is 1974 lines of added JSON (cl/beacon/handler/test_data/block_1.json)
-
This is a very large inline test fixture. It's a common pattern in Go tests, but the diff is hard to review meaningfully. The fixture itself looks properly structured.
-
Documentation file references local paths (.github/docs/caplin-hive-testing.md)
-
The doc references /erigon/mark/hive/ which is a local developer path. This is fine as internal documentation, but should be called out as non-portable.
Summary
This is a well-executed PR that enables a significant new capability (standalone caplin mode). The architecture is sound, the bug fixes are well-motivated by real test failures, and the code is clean. The main
risk areas are around the deferred beaconCfg initialization (issue #2) and missing Close() call (issue #1). The rest of the issues are minor or informational. The PR is ready for merge after addressing the
high-priority items.
…21310) ## Summary Two spec-deviations in `getFilterBlockTree` compound to reject valid current-epoch leaves at every epoch boundary, causing Caplin's `GetHead` to fall back to the justified checkpoint root (a ~30-50 slot regression) and triggering execution-side unwinds. ## Observed symptom (from #21301) On bloatnet, once execution catches up to chain tip, the node enters a steady-state cycle of **22-36 block execution unwinds every ~6 minutes** — one per epoch boundary. Histogram from one ~3h run (110 unwind events): ``` 1 size=11 1 size=14 4 size=15 1 size=16 4 size=18 2 size=19 3 size=20 5 size=21 9 size=22 14 size=23 ← most common 3 size=24 5 size=25 4 size=26 4 size=27 8 size=28 1 size=30 2 size=31 4 size=32 3 size=33 4 size=34 ...many size=35-36 ``` The unwinds are on the **same chain** (no real reorg) — execution is forced to roll back to an older head because Caplin regresses its `GetHead()` return value: ``` 08:38:45 Caplin FCU: headSlot=652928 currentSlot=652988 lagSlots=60 eth1Head=0xe0f3... 08:39:36 Caplin FCU: headSlot=652993 currentSlot=652993 lagSlots=0 eth1Head=0x263b... ← at tip 08:44:00 Caplin FCU: headSlot=652959 currentSlot=653015 lagSlots=56 eth1Head=0x7674... ← regressed 34 slots ``` The 08:44:00 FCU triggers a 23-block exec unwind: ``` [forkchoice] entering unwind path fcuNum=24701328 canonHash=0x7674... fcuHash=0x7674... hashesDiffer=false finishProgress=24701350 executionAhead=true Unwind Execution from=24701350 to=24701327 ``` `hashesDiffer=false` confirms same-chain. `executionAhead=true` because exec raced ahead of Caplin's regressing head. ## Root cause: filter rejects mid-epoch leaves Tracing inside `GetHead`: ``` [GetHead] cache miss, rebuilding from justifiedCheckpoint justifiedEpoch=20410 [filterTree] reject leaf: checkpoint mismatch blockRoot=0xe644... slot=653150 justifiedOk=false finalizedOk=false storeJustEpoch=20410 blockJustEpoch=20411 storeFinalEpoch=20409 blockFinalEpoch=20410 [GetHead] filtered tree size viableBlocks=0 totalHeads=1 [GetHead] rebuilt head headHash=0x... headSlot=653119 ← fallback to justified root Caplin is sending forkchoice headSlot=653119 currentSlot=653174 lagSlots=55 ``` Every leaf in the tree gets filtered because its `unrealizedJustifications[blockRoot].epoch = N+1` but the store's `justifiedCheckpoint.epoch = N` (store's realized lags by 1 epoch mid-epoch). The walk falls back to the justified root → 50+ slot regression. ## Spec deviations (the fix) ### 1. Voting-source selection used unrealized unconditionally Spec [`get_voting_source`](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/fork-choice.md#get_voting_source) picks unrealized only for **prior-epoch** blocks (pull-up justification view); current-epoch blocks should use the block's realized state checkpoint: ```python def get_voting_source(store: Store, block_root: Root) -> Checkpoint: block = store.blocks[block_root] current_epoch = get_current_store_epoch(store) block_epoch = compute_epoch_at_slot(block.slot) if current_epoch > block_epoch: return store.unrealized_justifications[block_root] else: return store.block_states[block_root].current_justified_checkpoint ``` Erigon's code: ```go // Use per-block unrealized justifications (spec: store.unrealized_justifications[block_root]) // Fall back to realized checkpoints if unrealized not available currentJustifiedCheckpoint, has := f.getUnrealizedJustification(blockRoot) if !has { currentJustifiedCheckpoint, has = f.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot) ... } ``` → Always picks unrealized when available. The "fall back" comment misreads the spec — spec is a conditional branch, not a fallback. ### 2. Justified/finalized check was strict equality Spec [`filter_block_tree`](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/fork-choice.md#filter_block_tree) has a lenient `+2 >= current_epoch` lookback: ```python correct_justified = ( store.justified_checkpoint.epoch == GENESIS_EPOCH or voting_source.epoch == store.justified_checkpoint.epoch or voting_source.epoch + 2 >= current_epoch # ← lenient lookback ) ``` Erigon's code only allowed the first two clauses (strict `Equal()`). ## Regression provenance The deviations were introduced in #20035 (`caplin: unified Engine API client for standalone mode`, Mar 25 2026) by Mark Holt: ``` 082697d cl/phase1/forkchoice/get_head.go +Use per-block unrealized justifications ``` Before that PR, `getFilterBlockTree` used `f.forkGraph.GetCurrentJustifiedCheckpoint(blockRoot)` directly — which returns the block's **realized** justified checkpoint. Block.realized matched store.realized (both on the same epoch-boundary timeline), so the equality check passed. PR #20035 introduced the per-block unrealized lookup (`store.unrealized_justifications` per the spec), but missed the conditional branching (spec deviation #1) and didn't add the +2 lookback (spec deviation #2). The result: block.unrealized goes ahead by 1 epoch mid-epoch, and store.realized doesn't catch up until `OnTick`'s epoch-boundary path runs. ## Fix Restore spec-faithful behavior in `getFilterBlockTree`: 1. Branch the voting source selection on `currentEpoch > blockEpoch`: - Prior-epoch → `store.unrealized_justifications[block_root]` - Current/future-epoch → block's realized `current_justified_checkpoint` In both branches, return `false` (reject the leaf) if the lookup is absent. `OnBlock` populates `unrealizedJustifications` for every imported block above the finalized slot, so a missing entry indicates either an invariant violation or a leaf below finalized — neither should produce a viable head. 2. Replace the strict `Equal()` check on the justified side with the spec's three flat disjuncts: `store.justified_checkpoint.epoch == GENESIS_EPOCH || voting_source.epoch == store.justified_checkpoint.epoch || voting_source.epoch + 2 >= current_epoch`. 3. Replace the finalized-side checkpoint-equality check with the spec's ancestor-descent rule: `store.finalized_checkpoint.root == get_checkpoint_block(block_root, finalized.epoch)`, implemented via `f.Ancestor(blockRoot, finalizedSlot)`. 4. Snapshot `currentEpoch` once at the start of the walk (in `getFilteredBlockTree`) and thread it through the recursion so the whole rebuild uses a single store-epoch value — `OnTick` updates `f.time` without holding `f.mu`, so per-leaf `f.Slot()` reads can mix epochs across leaves around a slot boundary. ## Validation Tested on bloatnet at chain tip, on `performance` HEAD + this fix: - **Before fix**: 22-36 block unwinds every ~6 min, one per epoch boundary, persisting indefinitely. 110 events in 3h. - **After fix**: zero unwinds at epoch boundaries observed across 30+ minutes (~5 epoch boundaries crossed). `[filterTree] reject` no longer fires. Effects: - **CPU waste eliminated**: ~30 blocks of execution work were being discarded and re-done every cycle. - **db_size**: unchanged (was already not affected since rolled-back state was in MemoryMutation overlay, not MDBX). - **Logs**: clean — no more spurious `Unwind Execution` lines at tip. ## Test plan - [x] Run on bloatnet to chain tip, observe no large unwinds at epoch boundaries - [x] Confirm no head regression in `Caplin is sending forkchoice` logs (lagSlots stays ~0 at tip) - [ ] Spectest still passes (`make spectest`) ## Refs Full diagnosis trail: #21301
…bitlist.go Restores cl/cltypes/eth1_block.go and cl/cltypes/solid/participation_bitlist.go to their pre-#20035 versions to test whether the perf regression for #21008 lives in these serialization changes. eth1_block.go is on the per-block deserialization path; participation_bitlist.go is touched on every attestation. This is on top of the prior forkchoice/* revert which alone did not restore mainnet sync performance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restores #20035's versions of beacon_block_reader.go, snapshot_format/blocks.go, execution_snapshot.go, test_util.go, and the new execution_rpc.go file. If perf regresses with this batch, regression is here. If clean, it's elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mit (erigontech#21444) ## Summary `ExecModule.UpdateForkChoice` spawns the forkchoice work in a goroutine and waited on a `done` channel that closes only after that goroutine **fully returns** — i.e. after flush + commit + prune. For a merge-extending fork at tip the forkchoice result is already sent on `outcomeCh` *before* the commit (`ParallelStateFlushing` is on by default), so the `<-done` wait threw that early-send away and blocked the consensus client for the entire EL commit. This removes the `<-done` wait: `UpdateForkChoice` returns as soon as the result lands on `outcomeCh`. ## Live-tip A/B (this branch vs main) Probe at the `ExecutionClientDirect.ForkChoiceUpdate` boundary (what Caplin's clstages loop blocks on), live mainnet tip, same datadir/machine/instrumentation: | FCU-response wall (ms) | n | min | p50 | p90 | max | mean | |------------------------|--:|----:|----:|----:|----:|-----:| | **main** | 100 | 96.7 | **260.2** | 386.3 | 12160.8 | 390.1 | | **this branch** | 82 | 0.7 | **1.2** | 3.0 | 4.3 | 1.7 | ~217× lower p50; distributions don't overlap (branch max 4.3 ms ≪ main min 96.7 ms). This matches release/3.4's ~1.4 ms at the same boundary. The `Timings: Forkchoice commit=…ms` line is unchanged — the commit still costs the same, it just no longer blocks the consensus loop. safe/finalized cadence is identical between branch and main, and wall does not correlate with safe/finalized changes, so this is not about forkchoice args. Both A/B binaries are post-erigontech#21327, so the comparison isolates only the `<-done` change — the gap is not confounded by the erigontech#21008/erigontech#20035 regression (both sides already have its fix). ## Scope / relationship to erigontech#21008 This is an **independent FCU-response-latency improvement**, not the fix for the erigontech#21008 in-sync regression. The bisection attributes erigontech#21008 to erigontech#20035, which is entirely CL-side (`cl/phase1/forkchoice/*`) and touches no `execution/execmodule/` code — its `getFilterBlockTree` spec-deviation regressed `GetHead` at epoch boundaries and is itself already fixed on main by erigontech#21310/erigontech#21327. The `<-done` wait predates the bisection's good commit `19f44b15cd` (it was added in erigontech#19981, and that commit measured 100% in-sync), so the EL-side blocking this PR removes is **not** what caused the 100→3% collapse. It is a real, separate cost worth removing on its own merits: at tip it returns the consensus loop to ~1 ms instead of ~260 ms (and avoids multi-second blocks when the commit overlaps a background merge/prune burst). ## Why it's safe The `<-done` was added (erigontech#19981) so a follow-up `AssembleBlock` could safely acquire the EL semaphore. That guarantee is preserved without it: - In `updateForkChoice`, the `defer e.semaphore.Release(1)` is registered **first**, so it runs **last** — after every cleanup defer (`ResetPendingUpdates`, `ClearWithUnwind`, overlay `Close`). Any op that subsequently acquires the semaphore necessarily observes fully-settled state. - `ExecutionClientDirect.ForkChoiceUpdate` already retries `AssembleBlock` (30× 200 ms) on semaphore contention, so a proposer FCU racing an in-flight commit resolves itself. - release/3.4 has no `<-done` and is the proven-good baseline. - Non-merge FCUs send their result at the end of `updateForkChoice` anyway, so their timing is unchanged. ## Test plan - [x] `execution/engineapi` `TestEngineApi*` block-production suite (the FCU→AssembleBlock path `<-done` guarded) - [x] `execution/execmodule/...` - [x] `cl/beacon/handler` `TestCaplinBlockProduction*` - [x] `make lint` clean - [x] Live-tip A/B (this branch vs main): FCU-response wall p50 260 ms → 1.2 ms, 0 gap errors over an epoch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Summary
Inserts an
ExecutionClientEnginethat calls the Engine API directly over HTTP/JWT so that caplin can be run as an api complient external — the same interface external consensus clients use. .The main motivation for this is to allow caplin to be used as a cl in assertor & hive tests - which excersize the standard test suites. It is also a precursor to using caplin as a zk proof varerified, which requires on EL.
It should also mean that caplin can be used with other EL's - but this has not been tested.
GetHead()now returns the justified/finalized checkpoints needed byforkchoiceUpdated, fixing standalone mode FCU callsTest plan
make lint— cleanmake erigon integration— buildsmake test-short— all pass (including beacon handler harness tests)