Skip to content

cl: implement GLOAS (EIP-7732 ePBS) for Caplin#18956

Merged
domiwei merged 123 commits into
mainfrom
feature/caplin_gloas
May 13, 2026
Merged

cl: implement GLOAS (EIP-7732 ePBS) for Caplin#18956
domiwei merged 123 commits into
mainfrom
feature/caplin_gloas

Conversation

@domiwei

@domiwei domiwei commented Feb 4, 2026

Copy link
Copy Markdown
Member

Summary

Full GLOAS (Glamsterdam) fork implementation for Caplin, Erigon's embedded consensus layer client. GLOAS introduces enshrined proposer-builder separation (ePBS) via EIP-7732.

Core implementation

  • Beacon state: execution payload envelope, PTC window, builder pending payments
  • Fork choice: deferred payload processing, BPS timing, payload timeliness committee (PTC) voting
  • Transition: ApplyParentExecutionPayload, epoch processing with PTC window state
  • SSZ: new types (ExecutionPayloadEnvelope, SignedExecutionPayloadEnvelope, PartialDataColumn)
  • Engine API: V5/V6 version routing for GLOAS payloads

Sync & networking

  • Forward sync: envelope-aware block processing, fork-boundary BeaconBlocksByRange capping
  • Checkpoint sync: anchor envelope storage for GLOAS state recovery
  • Gossip: ePBS topic scoring, fork digest resubscription fix, REJECT cascade prevention
  • P2P: multistream protocol negotiation fallback, HTTP beacon API fallback for forward sync
  • Block collector: single FCU after full flush (fixes TD overlay destruction between batches)

Stability fixes

  • Guard unrealized checkpoint promotion during forward sync (prevents ErrNotFinalizedDescendant)
  • Fix ChainReorgData.Depth uint64 underflow when headSlot < currentSlot
  • Track highestSeenRoot for consistent status advertisements (avoids Lighthouse "useless peer" ban)
  • Fix gossip REJECT cascade causing network collapse
  • Fix payload attestation slot and getPTCFromWindow underflow

Spec compliance

  • Aligned with consensus-specs v1.7.0-alpha.7
  • Spectests upgraded to v1.7.0-alpha.7
  • Beacon API: blinded block endpoint, GLOAS SSE events, PayloadStatus enum

Testing

  • Kurtosis mixed-CL suite (Lighthouse + Prysm + Caplin) passing
  • Block collector flush tests adapted for GLOAS payloadKey semantics
  • Validated on glamsterdam-devnet-2

Test plan

  • `make lint` clean
  • `make erigon integration` builds
  • Kurtosis mixed-CL test (Lighthouse + Prysm + Caplin) — assertoor block_proposal_check passed
  • glamsterdam-devnet-2 checkpoint sync and block verification
  • Spectests (`make test-short` on cl/spectest)

@domiwei domiwei requested a review from Giulio2002 as a code owner February 4, 2026 06:47
@domiwei domiwei changed the title WIP: Gloas hardfork WIP: Gloas upgrade Feb 4, 2026
@domiwei domiwei force-pushed the feature/caplin_gloas branch 4 times, most recently from 689d7bc to 457288b Compare February 13, 2026 06:39
@domiwei domiwei force-pushed the feature/caplin_gloas branch 2 times, most recently from 826100f to e8029ed Compare February 19, 2026 09:56
@domiwei domiwei force-pushed the feature/caplin_gloas branch 3 times, most recently from 18b0a57 to 37c12f7 Compare March 2, 2026 07:29
@domiwei domiwei requested a review from sudeepdino008 as a code owner March 3, 2026 09:34
@domiwei domiwei force-pushed the feature/caplin_gloas branch from 5ebf036 to 07e93f0 Compare March 6, 2026 04:55
@Giulio2002 Giulio2002 added dependencies Pull requests that update a dependency file Caplin Caplin: Consensus Layer, Beacon API labels Mar 10, 2026
@domiwei domiwei force-pushed the feature/caplin_gloas branch from 9264bc9 to 48b35a1 Compare March 16, 2026 05:27
@yperbasis yperbasis added the Glamsterdam https://eips.ethereum.org/EIPS/eip-7773 label Mar 16, 2026
@domiwei domiwei requested review from mh0lt and yperbasis as code owners March 18, 2026 16:35

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRITICAL — Consensus-breaking or will crash

  1. setLatestMessage reads new value instead of old for indexedWeightStore removal (on_attestation.go)

f.latestMessages.set(int(index), message) // writes FIRST
if oldMessage, has := f.latestMessages.get(int(index)); has { // reads AFTER → gets new value
f.indexedWeightStore.RemoveVote(index, oldMessage.Root) // removes NEW vote root
}
f.indexedWeightStore.IndexVote(index, message) // adds new vote back

The old message must be read before the write. As written, RemoveVote always removes the just-written root, then IndexVote adds it back — the old vote is never removed. Over time directVotes accumulates stale
entries, producing incorrect fork choice weights.

  1. SSZ baseOffsetSSZ() returns the same value for GLOAS as for Fulu (raw/ssz.go, raw/params.go)

GLOAS adds 7 new state fields (some variable-length = new offsets) and replaces latestExecutionPayloadHeader with latestExecutionPayloadBid. The base offset must differ. The EncodingSizeSSZ compensates via
subtraction/addition, but this double-accounting only works if the base is correct — and it isn't since it doesn't account for new offset slots. SSZ encode/decode will silently corrupt state data.

  1. ExecutionPayloadEnvelope.DecodeSSZ doesn't initialize Payload/ExecutionRequests (epbs_payload.go)

func (e *ExecutionPayloadEnvelope) DecodeSSZ(buf []byte, version int) error {
return ssz2.UnmarshalSSZ(buf, version,
e.Payload, // nil if not constructed via NewExecutionPayloadEnvelope
e.ExecutionRequests, // nil
...
)
}

Nil pointer dereference during unmarshal if called on a zero-value or Clone()-created envelope.

  1. PayloadAttestationData.HashSSZ hashes booleans as uint64 (epbs_payload.go)

boolToUint64(p.PayloadPresent), // uint64(0 or 1)
boolToUint64(p.BlobDataAvailable),

SSZ spec defines boolean leaves as a single byte zero-padded to 32 bytes, not 8-byte LE uint64. This produces incorrect hash tree roots — cross-client verification will fail.

  1. Ancestor returns parent-relationship payload status, not the block's own (utils.go)

getParentPayloadStatus(block) returns the payload status of the block relative to its parent, not the block's own status. This is used in isSupportingVote where the caller compares node.PayloadStatus == ancestor.PayloadStatus — the semantics are wrong.


HIGH — Incorrect behavior, races, goroutine leaks

  1. Race: OnBlock releases mutex then calls OnExecutionPayload (on_block.go)

Between f.mu.Unlock() and f.OnExecutionPayload (which re-acquires f.mu), another goroutine can modify fork choice state. The pending envelope error is also logged but not returned.

  1. TOCTOU race on sync.Map for PTC votes (on_payload_attestation_message.go, payload_vote.go)

existing, ok := f.payloadTimelinessVote.Load(blockRoot) // read
timelinessVotes[ptcIndex] = data.PayloadPresent // modify
f.payloadTimelinessVote.Store(blockRoot, timelinessVotes) // write

Two concurrent PTC messages for the same block root can each Load the same snapshot, modify different indices, and Store — second write overwrites the first's update. Lost PTC votes affect payload timeliness
checks.

  1. pendingCond.Wait() deadlocks on context cancellation (execution_payload_service.go, bid_service, payload_attestation_service)

All three new services share this pattern:
select {
case <-ctx.Done(): return // non-blocking check
default:
}
s.pendingCond.Wait() // blocks indefinitely — nothing wakes it on ctx cancel

If context is cancelled while in Wait(), the goroutine hangs forever. Need a separate goroutine calling Signal() on cancel, or switch to channels.

  1. seenSidecar returns nil instead of ErrIgnore (data_column_sidecar_service.go)

Returning nil for already-seen sidecars treats them as successfully processed, causing re-gossip to the network instead of silent drop per spec.

  1. ReadEnvelopeFromDisk can panic on corrupt/malicious length (fork_graph_disk_fs.go)

f.sszBuffer = f.sszBuffer[:binary.BigEndian.Uint64(lengthBytes)]

If the length from disk exceeds cap(f.sszBuffer), this panics. Needs bounds check.

  1. ExecutionPayloadBid.EncodingSizeSSZ missing 4-byte offset (epbs_payload.go)

BlobKzgCommitments is variable-length, so the fixed portion needs a 4-byte offset slot. Current size calculation adds BlobKzgCommitments.EncodingSizeSSZ() directly without the offset. Same issue in
SignedExecutionPayloadBid.

  1. BuilderPendingPayment.HashSSZ passes &b.Weight (pointer) (epbs_builder.go)

Other types pass uint64 directly. If HashTreeRoot doesn't dereference, it hashes the pointer value, not the integer.

  1. Builder payment weight may double-count (operations.go)

Same validator in multiple aggregated attestations for the same slot can accumulate weight multiple times. The willSetNewFlag check is per-attestation, not per-validator globally.


MEDIUM — Correctness and robustness issues

  1. Wall-clock slot underflow (on_block.go) — If wallNow < f.genesisTime (clock skew), uint64 subtraction wraps, producing a huge slot number that disables the "too early" check.

  2. validateParentPayloadPath skips validation when parent bid is nil (payload_vote.go) — Should check version, not just nil, to distinguish pre-GLOAS boundary from malformed GLOAS blocks.

  3. fetchParentEnvelopes can block chain_tip_sync for 150 seconds (chain_tip_sync.go) — 10 retries × 15s timeout with no total timeout.

  4. No bound on pendingGloasSidecars map size (data_column_sidecar_service.go) — Attacker flooding sidecars for non-existent blocks grows this unboundedly for 24s.

  5. pendingBidKey allows only one pending bid per (builder, slot) (execution_payload_bid_service.go) — Higher-value replacement bids are silently dropped.

  6. ExecutionPayloadEnvelope.Clone() is shallow (epbs_payload.go) — Payload and ExecutionRequests are shared with original; mutations affect both.

  7. updateLatestMessagesGloas vs updateLatestMessagesPreGloas at fork boundary (on_attestation.go) — LatestMessage serves double duty (Epoch for pre-GLOAS, Slot for GLOAS). Late pre-GLOAS attestation after fork
    activation would see stale Epoch=0.


LOW / Cleanup — Must fix before merge

  1. Debug logging in production code:
  • log.Info("[DEBUG] Block matched", ...) in backward_beacon_downloader.go
  • logger.Info("[DEBUG] Backward sync starting", ...) in clstages.go

These fire on every block/sync start — noisy in production.

  1. Commented-out code blocks in forward_sync.go with bug notes — remove or move to issue.

  2. .claude/skills/ files (caplin-gloas-audit/SKILL.md, launch-epbs-devnet-0/SKILL.md) — These are developer-local Claude Code configuration files and should not be committed to the repo.

  3. IsValidDepositSignature swallows BLS error — return false, nil instead of return false, err.

  4. Missing copyright headers in payload_vote.go, types.go, weight_store_indexed.go.

@domiwei domiwei force-pushed the feature/caplin_gloas branch from 0e0249c to 7a293b8 Compare April 8, 2026 08:30
@yperbasis yperbasis removed the dependencies Pull requests that update a dependency file label Apr 17, 2026
domiwei added a commit that referenced this pull request Apr 18, 2026
…cs, and spec deviations

Squash of fix/gloas-review-items addressing PR #18956 review findings:

- Fix pendingCond.Wait() deadlock on context cancellation in gossip services
- Return ErrIgnore (not nil) for already-seen data column sidecars
- Fix SSZ EncodingSizeSSZ +4 offsets for variable-length ePBS types
- Fix DecodeSSZ nil pointer deref via beaconCfg propagation
- Fix getPayloadStatusTiebreaker to match EIP-7732 spec
- Fix data race in OnPayloadAttestationMessage (GetState alwaysCopy=true)
- Filter unsolicited envelope responses by requested root set
- Return ErrIgnore when block not found for pending envelopes
- Deep copy BlobKzgCommitments in ExecutionPayloadBid.Copy()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@domiwei domiwei force-pushed the feature/caplin_gloas branch 2 times, most recently from aaf5552 to bb8d2d3 Compare April 22, 2026 04:28
domiwei added a commit that referenced this pull request Apr 24, 2026
…compliance

Address verified issues from PR #18956 review and Beacon API spec audit:

- forkchoice: eliminate TOCTOU race in OnBlock pending envelope processing by
  extracting applyEnvelopeLocked and calling it while f.mu is held, deferring
  DB index writes to after unlock to avoid deadlock
- fork_graph: add bounds checking in readBeaconStateFromDisk and
  ReadEnvelopeFromDisk to prevent panic on corrupt file lengths, with automatic
  buffer growth when needed
- beacon/handler: add dependent_root and execution_optimistic to PTC duties
  response, matching attester/proposer duty patterns
- beacon/handler: change POST pool/payload_attestations to accept array input
  with per-item error tracking, matching existing pool endpoint semantics
- beacon/handler: add WithOptimistic and WithFinalized metadata to GET
  execution_payload_envelope response
- beaconevents: fix execution_payload_available SSE event to emit flat
  {slot, block_root} per spec instead of full envelope with version wrapper

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@domiwei domiwei requested a review from mriccobene as a code owner April 25, 2026 04:35
@domiwei domiwei force-pushed the feature/caplin_gloas branch from 76bed42 to 6f33634 Compare April 28, 2026 16:41
domiwei added a commit that referenced this pull request Apr 28, 2026
…cs, and spec deviations

Squash of fix/gloas-review-items addressing PR #18956 review findings:

- Fix pendingCond.Wait() deadlock on context cancellation in gossip services
- Return ErrIgnore (not nil) for already-seen data column sidecars
- Fix SSZ EncodingSizeSSZ +4 offsets for variable-length ePBS types
- Fix DecodeSSZ nil pointer deref via beaconCfg propagation
- Fix getPayloadStatusTiebreaker to match EIP-7732 spec
- Fix data race in OnPayloadAttestationMessage (GetState alwaysCopy=true)
- Filter unsolicited envelope responses by requested root set
- Return ErrIgnore when block not found for pending envelopes
- Deep copy BlobKzgCommitments in ExecutionPayloadBid.Copy()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
domiwei and others added 18 commits May 12, 2026 16:20
The bal-devnet-3 image does not implement the full GLOAS (EIP-7732)
BeaconBlockBody encoding, causing Caplin to reject all gossipped
beacon blocks after the GLOAS fork with SSZ decode errors. Upgrade
to glamsterdam-devnet-2 which supports the v1.7.0-alpha.7 spec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ess tests

common.Hash is [32]byte — Go's encoding/json does not treat zero-value
arrays as empty for omitempty, so the GLOAS-only ExecutionBlockHash field
leaked into pre-GLOAS JSON responses. Add MarshalJSON() that matches the
SSZ schema version gating.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
recoverMissingEnvelopes was guarded by SupportInsertion(), which
returns false for Engine API (HTTP) connections where chainRW is nil.
This caused Caplin to never recover missed gossip envelopes in kurtosis
devnets, leading to a stuck head once gaps accumulated.

Move recoverMissingEnvelopes outside the SupportInsertion guard — it
only needs P2P and fork choice, not local EL insertion. Also fix the
backward walk to continue past the first envelope-present parent all
the way to the finalized slot, and check the HEAD block itself for a
missing envelope.

Add a 3-CL kurtosis devnet config (Lighthouse + Prysm + Caplin) used
to validate the fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…w underflow

Two fixes for GLOAS block production with skipped slots:

1. aggregatePayloadAttestations used baseBlockSlot (the actual parent
   block slot) instead of targetSlot-1 (state.slot - 1) when filtering
   PTC votes from the pool. With skipped slots the two differ, producing
   blocks with payload attestations that fail the spec check
   data.slot + 1 == state.slot. Both Lighthouse and Prysm use
   state.slot - 1; align Caplin to match.

2. getPTCFromWindow panicked with index out of range [-5] when the
   requested slot was 2+ epochs behind the state epoch, causing uint64
   underflow in the index arithmetic. Add an explicit epoch range check
   before the arithmetic, guard stateEpoch-1 against epoch 0 underflow,
   and fix the bounds check to compare unsigned values. GetPTC now falls
   back to ComputePTC when the slot is outside the ptcWindow range.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lapse in GLOAS

The GLOAS multi-client testnet (Caplin + Lighthouse) collapsed within
3 minutes due to a chain of gossipsub scoring failures:

1. data_column_sidecars_by_range rate limit was set to 128 tokens, but
   a single legitimate PeerDAS sync request costs ~991 tokens (8 slots
   × 124 columns). Every column sync request was denied, triggering a
   30-second peer punishment — peers could never catch up, causing fork
   divergence and cascading attestation/proposer REJECTs.

2. Transient errors from OnExecutionPayload (block state not yet
   available, block not in fork graph) were returned as plain errors
   instead of wrapping ErrIgnore, causing gossip REJECT + BanPeer for
   execution_payload messages that simply arrived before their block.

3. execution_payload_service.ProcessMessage did not propagate ErrIgnore
   or ErrEIP7594ColumnDataNotAvailable from forkchoice, converting all
   errors into REJECT.

4. maxScore() in gossip scoring omitted the four GLOAS topic weights
   (executionPayload, executionPayloadBid, payloadAttestation,
   proposerPreferences), making InvalidMessageDeliveriesWeight
   disproportionately harsh.

Fix: raise data column rate limit to 16384 tokens (proportional to
blockRate × NumberOfColumns), wrap transient forkchoice errors with
ErrIgnore, propagate ignore/column-unavailable errors in the service
layer, and include all GLOAS weights in maxScore().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- cl/network: add HTTP beacon API fallback for backward and forward
  block downloaders when P2P blocks_by_range fails
- cl/network: fix prevBatchTopBlock assignment direction in backward
  download (high→low means next batch needs responses[0])
- cl/network: add fork boundary capping to prevent cross-fork range
  requests rejected by peers
- cl/network: skip BanPeer for HTTP-fallback pseudo peer ID
- cl/network: split DB error from zero-hash in canSkipSlot to avoid
  silently treating DB failures as legitimate GLOAS EMPTY blocks
- cl/forkchoice: cap finalizedSlot to anchorSlot after checkpoint sync
  so Ancestor() stays within the fork graph horizon
- cl/forkchoice: initialize currentStateBlockRoot from anchor in fork
  graph
- cl/stages: allow ForwardSync with 0 peers (HTTP fallback works
  without P2P); handle ErrNotFinalizedDescendant gracefully
- cl/stages: fix history download finished condition when
  destinationSlotForEL is MaxUint64
- cl/checkpoint_sync: use actual state_id from URL instead of
  hardcoded "finalized" for envelope fetch
- execution/engineapi: gate BAL requirement on !IsEIPDisabled(7928) so
  Engine API stays consistent with builder/validator paths
- cl/network: revert diagnostic log levels back to Debug/Trace

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… checkpoint promotion guard

Move ForkChoiceUpdate out of per-batch insertBatch into a single call after
all batches are flushed, preventing the EL's in-memory TD overlay from being
destroyed mid-sync.  Remove the now-dead syncBackLoop throttle field and its
plumbing through ClStagesCfg.

Guard unrealized checkpoint promotion in on_tick_per_slot behind a
highestSeen proximity check so forward-syncing nodes don't jump the
finalized checkpoint past the blocks the fork graph actually contains.

Fix uint64 underflow in ChainReorgData.Depth when headSlot < currentSlot.

Track highestSeenRoot alongside highestSeen so status advertisements use a
consistent head root/slot pair (avoids Lighthouse "useless peer" penalties).

Auto-select GloasVersion FCU for post-GLOAS blocks based on SlotNumber
presence in the header.

Clamp lowestBlockToReach to 1 in history download progress to avoid
off-by-one stall at N-1/N.

Remove SupportBackfilling gate from history download so non-mainnet chains
(e.g. devnets) can backfill EL blocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pure whitespace — re-indent the P2P probing block to match the surrounding
brace level.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demote 49 log statements across 16 files that fire on hot paths
(per-slot, per-attestation, per-gossip-message) during normal GLOAS
operation. Info/Warn → Debug for routine subnet toggles, HTTP fallback
batches, builder nil-guards, and block publishing. Debug → Trace for
per-block forkchoice, payload attestation (512 PTC members/slot), and
execution payload service logs. Remove 6 previousStateRoot tracking
Debug logs that were leftover developer instrumentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cleanup

- Add missing GPL copyright headers to 7 new GLOAS files
- Fix IsValidDepositSignature to propagate BLS verify error instead
  of swallowing it (return false, err instead of false, nil)
- Remove GLOAS-specific .claude/skills/ from tracking and add to
  .gitignore to prevent re-adding

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 30s context timeout to fetchParentEnvelopes to cap total retry
  budget (was unbounded at ~150s worst case)
- Add pendingGloasSidecarCount with 4096 cap to prevent unbounded
  sync.Map growth from attacker-crafted sidecars
- Add nil guard for payment.Withdrawal in ProcessBuilderPendingPayments
  to prevent nil append if a BuilderPendingPayment has nil Withdrawal

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move gloas-caplin-mixed out of test-kurtosis-assertoor.yml (ci-gate) and
into a new test-kurtosis-gloas.yml that triggers on pull_request and
workflow_dispatch but is not called by ci-gate. Add gloas-three-cl-mixed
(Lighthouse + Prysm + Caplin) to the new workflow as well.

This keeps GLOAS e2e coverage visible on every PR without blocking merge
for suites that use devnet-only images.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the rm -rf lines for gloas and eip7732 fixture directories from
the test-fixtures Makefile target so that GLOAS spectests can actually
run against consensus-specs v1.7.0-alpha.7 test vectors.

Fix the IsBuilderIndex comment: BuilderIndexFlag is 1<<40 (bit 40),
not the most significant bit (bit 63).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update the cl_mainnet fixture manifest to consensus-specs v1.7.0-alpha.7
which includes GLOAS (EIP-7732) test vectors. The previous v1.6.0-alpha.6
did not match the alpha.7-based implementation, causing SSZ static test
failures for GLOAS types.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… handling

- Checkpoint sync: auto-normalize URLs missing /eth/ path, validate
  Content-Type to reject HTML responses, use Eth-Consensus-Version header
- Backward sync: add envelope failure recovery with HTTP beacon API
  fallback, root-based block fetch for fork choice divergence
- Forward sync: respect checkpoint anchor boundary via minSlot, clear
  error state between block iterations
- Envelope requests: limit by-range to single attempt (peers return EOF
  on devnets), increase retry backoff to 300ms
- Fork graph: fix anchor state root computation to distinguish fresh
  checkpoint sync vs restart from disk
- P2P: register mplex muxer for peer compatibility

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ld envelopes

verifyEnvelopeBuilderSignature and verifyExecutionPayloadEnvelopeSignature
both skipped BLS verification when BuilderIndex == SELF_BUILD and
Signature == InfiniteSignature. This bypass was reachable from the gossip
path, allowing an attacker to forge an envelope that Erigon accepts while
spec-compliant clients reject it, causing fork-choice weight divergence.

Fix: remove the content-based bypass entirely. Introduce a dedicated
ApplyLocalSelfBuildEnvelope method for the local block-production path
that uses DefaultMachine (FullValidation=false) to skip BLS checks.
Gossip/peer/REST paths always go through OnExecutionPayload which now
unconditionally verifies BLS signatures.

To handle the common timing where the envelope arrives before OnBlock
completes, local self-build envelopes queue into a separate
pendingLocalSelfBuildEnvelopes cache (not the general pendingEnvelopes).
OnBlock replay dispatches to the correct path based on which queue holds
the entry, not by inspecting envelope contents.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@domiwei domiwei force-pushed the feature/caplin_gloas branch from 7cd6f8c to 16f7279 Compare May 12, 2026 16:26
@domiwei domiwei added this pull request to the merge queue May 13, 2026
Merged via the queue into main with commit f3ad70f May 13, 2026
60 checks passed
@domiwei domiwei deleted the feature/caplin_gloas branch May 13, 2026 07:54
lystopad pushed a commit to lmorett1/erigon that referenced this pull request May 19, 2026
…ech#21228)

## Summary

Fixes erigontech#21272 — Caplin gets stuck with `baseState not found in graph` on
mainnet after the GLOAS merge (erigontech#18956).

**Root cause**: The GLOAS PR added `maxSSZBufferSize = 128 MB` in
`readBeaconStateFromDisk`. Mainnet beacon state with ~1.5M validators is
~327 MB after decompression, so every state read was rejected as
"corrupt". This made `getCheckpointState` always fail.

Three additional issues compounded the problem:

- **Duplicate `InitBeaconState`**: `DecodeSSZ` already calls
`InitBeaconState` internally (`ssz.go:40`). The explicit second call
doubled the decode time by rebuilding the pubkey index for ~1.5M
validators.

- **Lock contention**: `getHead` held `f.mu.Lock` while calling
`getCheckpointState`, which reads and decodes a large state from disk.
This blocked the `OnTick` goroutine for the entire duration.

- **Missing fallback**: After checkpoint sync, the justified checkpoint
root may predate the anchor and not exist in the fork graph. Before
GLOAS this was masked because forward sync always ran to completion (no
stale timeout).

## Changes

- Replace 128 MB `maxSSZBufferSize` with a 1 GiB `maxSSZObjectSize` cap
in `readBeaconStateFromDisk` / `ReadEnvelopeFromDisk` — ~3x headroom
over current mainnet ~327 MB states while still bounding `make([]byte,
length)` against a corrupt length field
- Remove duplicate `InitBeaconState()` in `readBeaconStateFromDisk`
- Move `getCheckpointState` before `f.mu.Lock()` in `getHead`
- Fall back to anchor state in `getCheckpointState` when checkpoint root
is not in graph

## Test plan

- Verified on `dev-bm-e3-sepolia-n1` (mainnet, fresh datadir):
  - Before: `baseState not found in graph` on every ForkChoice round
  - After: `Imported chain segment` every ~12s, ForkChoice in ~2.9s
- Existing unit tests pass
(`TestGetState_InfiniteLoopOnMissingStateFile`, `TestForkGraphInDisk`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
sudeepdino008 added a commit that referenced this pull request May 21, 2026
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent
for the finalized check but left two justified-side deviations from
the spec's filter_block_tree / get_voting_source:

1. Voting source selection used unrealized unconditionally with a
   realized fallback. Spec get_voting_source branches strictly on
   current_epoch > block_epoch — unrealized only for prior-epoch
   blocks; current-epoch blocks must use the block's realized
   current_justified_checkpoint.

2. correct_justified used voting_source.epoch >= justified.epoch with
   an is_previous_epoch_justified-gated +2 fallback. Spec is the
   three flat disjuncts: epoch == GENESIS, voting_source.epoch ==
   justified.epoch, or voting_source.epoch + 2 >= current_epoch.

The finalized ancestor-descent check from f3ad70f is preserved
unchanged.

Observed effect on bloatnet: epoch-boundary head regression that
caused ~30-block execution unwinds every ~6 minutes is eliminated.
Validated by 1 hour at chain tip on bloatnet covering 5+ epoch
boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0
on all 172 forkchoice updates in the window.

Refs: #21301, #21310
sudeepdino008 added a commit that referenced this pull request May 21, 2026
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent
for the finalized check but left two justified-side deviations from
the spec's filter_block_tree / get_voting_source:

1. Voting source selection used unrealized unconditionally with a
   realized fallback. Spec get_voting_source branches strictly on
   current_epoch > block_epoch — unrealized only for prior-epoch
   blocks; current-epoch blocks must use the block's realized
   current_justified_checkpoint.

2. correct_justified used voting_source.epoch >= justified.epoch with
   an is_previous_epoch_justified-gated +2 fallback. Spec is the
   three flat disjuncts: epoch == GENESIS, voting_source.epoch ==
   justified.epoch, or voting_source.epoch + 2 >= current_epoch.

The finalized ancestor-descent check from f3ad70f is preserved
unchanged.

Observed effect on bloatnet: epoch-boundary head regression that
caused ~30-block execution unwinds every ~6 minutes is eliminated.
Validated by 1 hour at chain tip on bloatnet covering 5+ epoch
boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0
on all 172 forkchoice updates in the window.

Refs: #21301, #21310
pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request Jun 9, 2026
…LOAS (erigontech#21698)

## Problem

Lido Hoodi validators lost attestation score after switching to
`release/3.5`. They missed **head votes** at ~30–60% per epoch (≈2× a
random network sample) while **target and source votes were always
correct** — the attestations were produced, published, and included
on-chain on time, but voted for a **stale head**.

## Root cause

The GLOAS merge (erigontech#18956) added `indexedWeightStore`
(`cl/phase1/forkchoice/weight_store_indexed.go`). It is instantiated
unconditionally, and its `IndexVote`/`RemoveVote` are called per
validator index on **every** attestation via `setLatestMessage`,
regardless of fork.

However its results are only consumed by GLOAS `get_head`. Pre-GLOAS
`get_head` (and `timing.go`) use the non-indexed `weightStore`, and
`GetIndexedWeightStore()` has no callers. So on pre-GLOAS chains the
index is maintained but never read.

On a high-validator-count network (Hoodi, ~1.2M active validators) this
maintenance was the single largest CPU consumer (`RemoveVote`, ~15% of
CPU, fresh slice allocation per call) plus a large GC load — all under
the fork-choice write lock. The lock contention delayed `OnBlock` (block
import) and `get_head` past the attestation deadline, so the head served
to the validator client was stale → wrong head votes.

## Fix

Gate the indexed-store maintenance to the GLOAS vote path only, via an
explicit `maintainIndexedVotes` flag on `setLatestMessage` (mirroring
the existing `updateLatestMessages` pre-GLOAS/GLOAS dispatch).

## Validation (live Hoodi node, before vs after rebuild)

- Node CPU (operator Grafana): ~7.5% → <2%.
- 60s CPU profile: total samples 148.9% → 41.6%;
`indexedWeightStore.RemoveVote` (was #1 hotspot) and the GC storm both
gone.
- Head-at-attestation-deadline staleness: 64% → 4%.
- Validator head-vote misses: 40% → 0% in the first fully post-fix epoch
(network sample ~23% in the same epoch); target/source unaffected
throughout.

Affects all pre-GLOAS networks on this branch (mainnet/Sepolia/Gnosis),
with impact scaling by validator count.

## Tests

- `TestPreGloasDoesNotMaintainIndexedWeightStore` — fails before the fix
(the pre-GLOAS path populates the index), passes after.
- `TestGloasMaintainsIndexedWeightStore` — locks in that the GLOAS path
still indexes votes.
- `go test ./cl/phase1/forkchoice/...`, `make lint`, `make erigon` all
clean.

## Follow-up

`indexedWeightStore` is currently unused even on GLOAS (`get_head` uses
the non-indexed store). The Caplin team should either wire it into GLOAS
`get_head` or remove it; tracking separately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Caplin Caplin: Consensus Layer, Beacon API Glamsterdam https://eips.ethereum.org/EIPS/eip-7773

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants