cl: implement GLOAS (EIP-7732 ePBS) for Caplin#18956
Conversation
689d7bc to
457288b
Compare
826100f to
e8029ed
Compare
18b0a57 to
37c12f7
Compare
5ebf036 to
07e93f0
Compare
9264bc9 to
48b35a1
Compare
yperbasis
left a comment
There was a problem hiding this comment.
CRITICAL — Consensus-breaking or will crash
- setLatestMessage reads new value instead of old for indexedWeightStore removal (on_attestation.go)
f.latestMessages.set(int(index), message) // writes FIRST
if oldMessage, has := f.latestMessages.get(int(index)); has { // reads AFTER → gets new value
f.indexedWeightStore.RemoveVote(index, oldMessage.Root) // removes NEW vote root
}
f.indexedWeightStore.IndexVote(index, message) // adds new vote back
The old message must be read before the write. As written, RemoveVote always removes the just-written root, then IndexVote adds it back — the old vote is never removed. Over time directVotes accumulates stale
entries, producing incorrect fork choice weights.
- SSZ baseOffsetSSZ() returns the same value for GLOAS as for Fulu (raw/ssz.go, raw/params.go)
GLOAS adds 7 new state fields (some variable-length = new offsets) and replaces latestExecutionPayloadHeader with latestExecutionPayloadBid. The base offset must differ. The EncodingSizeSSZ compensates via
subtraction/addition, but this double-accounting only works if the base is correct — and it isn't since it doesn't account for new offset slots. SSZ encode/decode will silently corrupt state data.
- ExecutionPayloadEnvelope.DecodeSSZ doesn't initialize Payload/ExecutionRequests (epbs_payload.go)
func (e *ExecutionPayloadEnvelope) DecodeSSZ(buf []byte, version int) error {
return ssz2.UnmarshalSSZ(buf, version,
e.Payload, // nil if not constructed via NewExecutionPayloadEnvelope
e.ExecutionRequests, // nil
...
)
}
Nil pointer dereference during unmarshal if called on a zero-value or Clone()-created envelope.
- PayloadAttestationData.HashSSZ hashes booleans as uint64 (epbs_payload.go)
boolToUint64(p.PayloadPresent), // uint64(0 or 1)
boolToUint64(p.BlobDataAvailable),
SSZ spec defines boolean leaves as a single byte zero-padded to 32 bytes, not 8-byte LE uint64. This produces incorrect hash tree roots — cross-client verification will fail.
- Ancestor returns parent-relationship payload status, not the block's own (utils.go)
getParentPayloadStatus(block) returns the payload status of the block relative to its parent, not the block's own status. This is used in isSupportingVote where the caller compares node.PayloadStatus == ancestor.PayloadStatus — the semantics are wrong.
HIGH — Incorrect behavior, races, goroutine leaks
- Race: OnBlock releases mutex then calls OnExecutionPayload (on_block.go)
Between f.mu.Unlock() and f.OnExecutionPayload (which re-acquires f.mu), another goroutine can modify fork choice state. The pending envelope error is also logged but not returned.
- TOCTOU race on sync.Map for PTC votes (on_payload_attestation_message.go, payload_vote.go)
existing, ok := f.payloadTimelinessVote.Load(blockRoot) // read
timelinessVotes[ptcIndex] = data.PayloadPresent // modify
f.payloadTimelinessVote.Store(blockRoot, timelinessVotes) // write
Two concurrent PTC messages for the same block root can each Load the same snapshot, modify different indices, and Store — second write overwrites the first's update. Lost PTC votes affect payload timeliness
checks.
- pendingCond.Wait() deadlocks on context cancellation (execution_payload_service.go, bid_service, payload_attestation_service)
All three new services share this pattern:
select {
case <-ctx.Done(): return // non-blocking check
default:
}
s.pendingCond.Wait() // blocks indefinitely — nothing wakes it on ctx cancel
If context is cancelled while in Wait(), the goroutine hangs forever. Need a separate goroutine calling Signal() on cancel, or switch to channels.
- seenSidecar returns nil instead of ErrIgnore (data_column_sidecar_service.go)
Returning nil for already-seen sidecars treats them as successfully processed, causing re-gossip to the network instead of silent drop per spec.
- ReadEnvelopeFromDisk can panic on corrupt/malicious length (fork_graph_disk_fs.go)
f.sszBuffer = f.sszBuffer[:binary.BigEndian.Uint64(lengthBytes)]
If the length from disk exceeds cap(f.sszBuffer), this panics. Needs bounds check.
- ExecutionPayloadBid.EncodingSizeSSZ missing 4-byte offset (epbs_payload.go)
BlobKzgCommitments is variable-length, so the fixed portion needs a 4-byte offset slot. Current size calculation adds BlobKzgCommitments.EncodingSizeSSZ() directly without the offset. Same issue in
SignedExecutionPayloadBid.
- BuilderPendingPayment.HashSSZ passes &b.Weight (pointer) (epbs_builder.go)
Other types pass uint64 directly. If HashTreeRoot doesn't dereference, it hashes the pointer value, not the integer.
- Builder payment weight may double-count (operations.go)
Same validator in multiple aggregated attestations for the same slot can accumulate weight multiple times. The willSetNewFlag check is per-attestation, not per-validator globally.
MEDIUM — Correctness and robustness issues
-
Wall-clock slot underflow (on_block.go) — If wallNow < f.genesisTime (clock skew), uint64 subtraction wraps, producing a huge slot number that disables the "too early" check.
-
validateParentPayloadPath skips validation when parent bid is nil (payload_vote.go) — Should check version, not just nil, to distinguish pre-GLOAS boundary from malformed GLOAS blocks.
-
fetchParentEnvelopes can block chain_tip_sync for 150 seconds (chain_tip_sync.go) — 10 retries × 15s timeout with no total timeout.
-
No bound on pendingGloasSidecars map size (data_column_sidecar_service.go) — Attacker flooding sidecars for non-existent blocks grows this unboundedly for 24s.
-
pendingBidKey allows only one pending bid per (builder, slot) (execution_payload_bid_service.go) — Higher-value replacement bids are silently dropped.
-
ExecutionPayloadEnvelope.Clone() is shallow (epbs_payload.go) — Payload and ExecutionRequests are shared with original; mutations affect both.
-
updateLatestMessagesGloas vs updateLatestMessagesPreGloas at fork boundary (on_attestation.go) — LatestMessage serves double duty (Epoch for pre-GLOAS, Slot for GLOAS). Late pre-GLOAS attestation after fork
activation would see stale Epoch=0.
LOW / Cleanup — Must fix before merge
- Debug logging in production code:
- log.Info("[DEBUG] Block matched", ...) in backward_beacon_downloader.go
- logger.Info("[DEBUG] Backward sync starting", ...) in clstages.go
These fire on every block/sync start — noisy in production.
-
Commented-out code blocks in forward_sync.go with bug notes — remove or move to issue.
-
.claude/skills/ files (caplin-gloas-audit/SKILL.md, launch-epbs-devnet-0/SKILL.md) — These are developer-local Claude Code configuration files and should not be committed to the repo.
-
IsValidDepositSignature swallows BLS error — return false, nil instead of return false, err.
-
Missing copyright headers in payload_vote.go, types.go, weight_store_indexed.go.
0e0249c to
7a293b8
Compare
…cs, and spec deviations Squash of fix/gloas-review-items addressing PR #18956 review findings: - Fix pendingCond.Wait() deadlock on context cancellation in gossip services - Return ErrIgnore (not nil) for already-seen data column sidecars - Fix SSZ EncodingSizeSSZ +4 offsets for variable-length ePBS types - Fix DecodeSSZ nil pointer deref via beaconCfg propagation - Fix getPayloadStatusTiebreaker to match EIP-7732 spec - Fix data race in OnPayloadAttestationMessage (GetState alwaysCopy=true) - Filter unsolicited envelope responses by requested root set - Return ErrIgnore when block not found for pending envelopes - Deep copy BlobKzgCommitments in ExecutionPayloadBid.Copy() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
aaf5552 to
bb8d2d3
Compare
…compliance Address verified issues from PR #18956 review and Beacon API spec audit: - forkchoice: eliminate TOCTOU race in OnBlock pending envelope processing by extracting applyEnvelopeLocked and calling it while f.mu is held, deferring DB index writes to after unlock to avoid deadlock - fork_graph: add bounds checking in readBeaconStateFromDisk and ReadEnvelopeFromDisk to prevent panic on corrupt file lengths, with automatic buffer growth when needed - beacon/handler: add dependent_root and execution_optimistic to PTC duties response, matching attester/proposer duty patterns - beacon/handler: change POST pool/payload_attestations to accept array input with per-item error tracking, matching existing pool endpoint semantics - beacon/handler: add WithOptimistic and WithFinalized metadata to GET execution_payload_envelope response - beaconevents: fix execution_payload_available SSE event to emit flat {slot, block_root} per spec instead of full envelope with version wrapper Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
76bed42 to
6f33634
Compare
…cs, and spec deviations Squash of fix/gloas-review-items addressing PR #18956 review findings: - Fix pendingCond.Wait() deadlock on context cancellation in gossip services - Return ErrIgnore (not nil) for already-seen data column sidecars - Fix SSZ EncodingSizeSSZ +4 offsets for variable-length ePBS types - Fix DecodeSSZ nil pointer deref via beaconCfg propagation - Fix getPayloadStatusTiebreaker to match EIP-7732 spec - Fix data race in OnPayloadAttestationMessage (GetState alwaysCopy=true) - Filter unsolicited envelope responses by requested root set - Return ErrIgnore when block not found for pending envelopes - Deep copy BlobKzgCommitments in ExecutionPayloadBid.Copy() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The bal-devnet-3 image does not implement the full GLOAS (EIP-7732) BeaconBlockBody encoding, causing Caplin to reject all gossipped beacon blocks after the GLOAS fork with SSZ decode errors. Upgrade to glamsterdam-devnet-2 which supports the v1.7.0-alpha.7 spec. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ess tests common.Hash is [32]byte — Go's encoding/json does not treat zero-value arrays as empty for omitempty, so the GLOAS-only ExecutionBlockHash field leaked into pre-GLOAS JSON responses. Add MarshalJSON() that matches the SSZ schema version gating. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
recoverMissingEnvelopes was guarded by SupportInsertion(), which returns false for Engine API (HTTP) connections where chainRW is nil. This caused Caplin to never recover missed gossip envelopes in kurtosis devnets, leading to a stuck head once gaps accumulated. Move recoverMissingEnvelopes outside the SupportInsertion guard — it only needs P2P and fork choice, not local EL insertion. Also fix the backward walk to continue past the first envelope-present parent all the way to the finalized slot, and check the HEAD block itself for a missing envelope. Add a 3-CL kurtosis devnet config (Lighthouse + Prysm + Caplin) used to validate the fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…w underflow Two fixes for GLOAS block production with skipped slots: 1. aggregatePayloadAttestations used baseBlockSlot (the actual parent block slot) instead of targetSlot-1 (state.slot - 1) when filtering PTC votes from the pool. With skipped slots the two differ, producing blocks with payload attestations that fail the spec check data.slot + 1 == state.slot. Both Lighthouse and Prysm use state.slot - 1; align Caplin to match. 2. getPTCFromWindow panicked with index out of range [-5] when the requested slot was 2+ epochs behind the state epoch, causing uint64 underflow in the index arithmetic. Add an explicit epoch range check before the arithmetic, guard stateEpoch-1 against epoch 0 underflow, and fix the bounds check to compare unsigned values. GetPTC now falls back to ComputePTC when the slot is outside the ptcWindow range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lapse in GLOAS The GLOAS multi-client testnet (Caplin + Lighthouse) collapsed within 3 minutes due to a chain of gossipsub scoring failures: 1. data_column_sidecars_by_range rate limit was set to 128 tokens, but a single legitimate PeerDAS sync request costs ~991 tokens (8 slots × 124 columns). Every column sync request was denied, triggering a 30-second peer punishment — peers could never catch up, causing fork divergence and cascading attestation/proposer REJECTs. 2. Transient errors from OnExecutionPayload (block state not yet available, block not in fork graph) were returned as plain errors instead of wrapping ErrIgnore, causing gossip REJECT + BanPeer for execution_payload messages that simply arrived before their block. 3. execution_payload_service.ProcessMessage did not propagate ErrIgnore or ErrEIP7594ColumnDataNotAvailable from forkchoice, converting all errors into REJECT. 4. maxScore() in gossip scoring omitted the four GLOAS topic weights (executionPayload, executionPayloadBid, payloadAttestation, proposerPreferences), making InvalidMessageDeliveriesWeight disproportionately harsh. Fix: raise data column rate limit to 16384 tokens (proportional to blockRate × NumberOfColumns), wrap transient forkchoice errors with ErrIgnore, propagate ignore/column-unavailable errors in the service layer, and include all GLOAS weights in maxScore(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- cl/network: add HTTP beacon API fallback for backward and forward block downloaders when P2P blocks_by_range fails - cl/network: fix prevBatchTopBlock assignment direction in backward download (high→low means next batch needs responses[0]) - cl/network: add fork boundary capping to prevent cross-fork range requests rejected by peers - cl/network: skip BanPeer for HTTP-fallback pseudo peer ID - cl/network: split DB error from zero-hash in canSkipSlot to avoid silently treating DB failures as legitimate GLOAS EMPTY blocks - cl/forkchoice: cap finalizedSlot to anchorSlot after checkpoint sync so Ancestor() stays within the fork graph horizon - cl/forkchoice: initialize currentStateBlockRoot from anchor in fork graph - cl/stages: allow ForwardSync with 0 peers (HTTP fallback works without P2P); handle ErrNotFinalizedDescendant gracefully - cl/stages: fix history download finished condition when destinationSlotForEL is MaxUint64 - cl/checkpoint_sync: use actual state_id from URL instead of hardcoded "finalized" for envelope fetch - execution/engineapi: gate BAL requirement on !IsEIPDisabled(7928) so Engine API stays consistent with builder/validator paths - cl/network: revert diagnostic log levels back to Debug/Trace Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… checkpoint promotion guard Move ForkChoiceUpdate out of per-batch insertBatch into a single call after all batches are flushed, preventing the EL's in-memory TD overlay from being destroyed mid-sync. Remove the now-dead syncBackLoop throttle field and its plumbing through ClStagesCfg. Guard unrealized checkpoint promotion in on_tick_per_slot behind a highestSeen proximity check so forward-syncing nodes don't jump the finalized checkpoint past the blocks the fork graph actually contains. Fix uint64 underflow in ChainReorgData.Depth when headSlot < currentSlot. Track highestSeenRoot alongside highestSeen so status advertisements use a consistent head root/slot pair (avoids Lighthouse "useless peer" penalties). Auto-select GloasVersion FCU for post-GLOAS blocks based on SlotNumber presence in the header. Clamp lowestBlockToReach to 1 in history download progress to avoid off-by-one stall at N-1/N. Remove SupportBackfilling gate from history download so non-mainnet chains (e.g. devnets) can backfill EL blocks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pure whitespace — re-indent the P2P probing block to match the surrounding brace level. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demote 49 log statements across 16 files that fire on hot paths (per-slot, per-attestation, per-gossip-message) during normal GLOAS operation. Info/Warn → Debug for routine subnet toggles, HTTP fallback batches, builder nil-guards, and block publishing. Debug → Trace for per-block forkchoice, payload attestation (512 PTC members/slot), and execution payload service logs. Remove 6 previousStateRoot tracking Debug logs that were leftover developer instrumentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cleanup - Add missing GPL copyright headers to 7 new GLOAS files - Fix IsValidDepositSignature to propagate BLS verify error instead of swallowing it (return false, err instead of false, nil) - Remove GLOAS-specific .claude/skills/ from tracking and add to .gitignore to prevent re-adding Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add 30s context timeout to fetchParentEnvelopes to cap total retry budget (was unbounded at ~150s worst case) - Add pendingGloasSidecarCount with 4096 cap to prevent unbounded sync.Map growth from attacker-crafted sidecars - Add nil guard for payment.Withdrawal in ProcessBuilderPendingPayments to prevent nil append if a BuilderPendingPayment has nil Withdrawal Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move gloas-caplin-mixed out of test-kurtosis-assertoor.yml (ci-gate) and into a new test-kurtosis-gloas.yml that triggers on pull_request and workflow_dispatch but is not called by ci-gate. Add gloas-three-cl-mixed (Lighthouse + Prysm + Caplin) to the new workflow as well. This keeps GLOAS e2e coverage visible on every PR without blocking merge for suites that use devnet-only images. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the rm -rf lines for gloas and eip7732 fixture directories from the test-fixtures Makefile target so that GLOAS spectests can actually run against consensus-specs v1.7.0-alpha.7 test vectors. Fix the IsBuilderIndex comment: BuilderIndexFlag is 1<<40 (bit 40), not the most significant bit (bit 63). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update the cl_mainnet fixture manifest to consensus-specs v1.7.0-alpha.7 which includes GLOAS (EIP-7732) test vectors. The previous v1.6.0-alpha.6 did not match the alpha.7-based implementation, causing SSZ static test failures for GLOAS types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… handling - Checkpoint sync: auto-normalize URLs missing /eth/ path, validate Content-Type to reject HTML responses, use Eth-Consensus-Version header - Backward sync: add envelope failure recovery with HTTP beacon API fallback, root-based block fetch for fork choice divergence - Forward sync: respect checkpoint anchor boundary via minSlot, clear error state between block iterations - Envelope requests: limit by-range to single attempt (peers return EOF on devnets), increase retry backoff to 300ms - Fork graph: fix anchor state root computation to distinguish fresh checkpoint sync vs restart from disk - P2P: register mplex muxer for peer compatibility Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ld envelopes verifyEnvelopeBuilderSignature and verifyExecutionPayloadEnvelopeSignature both skipped BLS verification when BuilderIndex == SELF_BUILD and Signature == InfiniteSignature. This bypass was reachable from the gossip path, allowing an attacker to forge an envelope that Erigon accepts while spec-compliant clients reject it, causing fork-choice weight divergence. Fix: remove the content-based bypass entirely. Introduce a dedicated ApplyLocalSelfBuildEnvelope method for the local block-production path that uses DefaultMachine (FullValidation=false) to skip BLS checks. Gossip/peer/REST paths always go through OnExecutionPayload which now unconditionally verifies BLS signatures. To handle the common timing where the envelope arrives before OnBlock completes, local self-build envelopes queue into a separate pendingLocalSelfBuildEnvelopes cache (not the general pendingEnvelopes). OnBlock replay dispatches to the correct path based on which queue holds the entry, not by inspecting envelope contents. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7cd6f8c to
16f7279
Compare
…ech#21228) ## Summary Fixes erigontech#21272 — Caplin gets stuck with `baseState not found in graph` on mainnet after the GLOAS merge (erigontech#18956). **Root cause**: The GLOAS PR added `maxSSZBufferSize = 128 MB` in `readBeaconStateFromDisk`. Mainnet beacon state with ~1.5M validators is ~327 MB after decompression, so every state read was rejected as "corrupt". This made `getCheckpointState` always fail. Three additional issues compounded the problem: - **Duplicate `InitBeaconState`**: `DecodeSSZ` already calls `InitBeaconState` internally (`ssz.go:40`). The explicit second call doubled the decode time by rebuilding the pubkey index for ~1.5M validators. - **Lock contention**: `getHead` held `f.mu.Lock` while calling `getCheckpointState`, which reads and decodes a large state from disk. This blocked the `OnTick` goroutine for the entire duration. - **Missing fallback**: After checkpoint sync, the justified checkpoint root may predate the anchor and not exist in the fork graph. Before GLOAS this was masked because forward sync always ran to completion (no stale timeout). ## Changes - Replace 128 MB `maxSSZBufferSize` with a 1 GiB `maxSSZObjectSize` cap in `readBeaconStateFromDisk` / `ReadEnvelopeFromDisk` — ~3x headroom over current mainnet ~327 MB states while still bounding `make([]byte, length)` against a corrupt length field - Remove duplicate `InitBeaconState()` in `readBeaconStateFromDisk` - Move `getCheckpointState` before `f.mu.Lock()` in `getHead` - Fall back to anchor state in `getCheckpointState` when checkpoint root is not in graph ## Test plan - Verified on `dev-bm-e3-sepolia-n1` (mainnet, fresh datadir): - Before: `baseState not found in graph` on every ForkChoice round - After: `Imported chain segment` every ~12s, ForkChoice in ~2.9s - Existing unit tests pass (`TestGetState_InfiniteLoopOnMissingStateFile`, `TestForkGraphInDisk`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: yperbasis <andrey.ashikhmin@gmail.com>
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent for the finalized check but left two justified-side deviations from the spec's filter_block_tree / get_voting_source: 1. Voting source selection used unrealized unconditionally with a realized fallback. Spec get_voting_source branches strictly on current_epoch > block_epoch — unrealized only for prior-epoch blocks; current-epoch blocks must use the block's realized current_justified_checkpoint. 2. correct_justified used voting_source.epoch >= justified.epoch with an is_previous_epoch_justified-gated +2 fallback. Spec is the three flat disjuncts: epoch == GENESIS, voting_source.epoch == justified.epoch, or voting_source.epoch + 2 >= current_epoch. The finalized ancestor-descent check from f3ad70f is preserved unchanged. Observed effect on bloatnet: epoch-boundary head regression that caused ~30-block execution unwinds every ~6 minutes is eliminated. Validated by 1 hour at chain tip on bloatnet covering 5+ epoch boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0 on all 172 forkchoice updates in the window. Refs: #21301, #21310
The post-GLOAS rewrite (#18956, f3ad70f) restored ancestor-descent for the finalized check but left two justified-side deviations from the spec's filter_block_tree / get_voting_source: 1. Voting source selection used unrealized unconditionally with a realized fallback. Spec get_voting_source branches strictly on current_epoch > block_epoch — unrealized only for prior-epoch blocks; current-epoch blocks must use the block's realized current_justified_checkpoint. 2. correct_justified used voting_source.epoch >= justified.epoch with an is_previous_epoch_justified-gated +2 fallback. Spec is the three flat disjuncts: epoch == GENESIS, voting_source.epoch == justified.epoch, or voting_source.epoch + 2 >= current_epoch. The finalized ancestor-descent check from f3ad70f is preserved unchanged. Observed effect on bloatnet: epoch-boundary head regression that caused ~30-block execution unwinds every ~6 minutes is eliminated. Validated by 1 hour at chain tip on bloatnet covering 5+ epoch boundaries — zero unwinds, zero filterTree rejects, Caplin lagSlots=0 on all 172 forkchoice updates in the window. Refs: #21301, #21310
…LOAS (erigontech#21698) ## Problem Lido Hoodi validators lost attestation score after switching to `release/3.5`. They missed **head votes** at ~30–60% per epoch (≈2× a random network sample) while **target and source votes were always correct** — the attestations were produced, published, and included on-chain on time, but voted for a **stale head**. ## Root cause The GLOAS merge (erigontech#18956) added `indexedWeightStore` (`cl/phase1/forkchoice/weight_store_indexed.go`). It is instantiated unconditionally, and its `IndexVote`/`RemoveVote` are called per validator index on **every** attestation via `setLatestMessage`, regardless of fork. However its results are only consumed by GLOAS `get_head`. Pre-GLOAS `get_head` (and `timing.go`) use the non-indexed `weightStore`, and `GetIndexedWeightStore()` has no callers. So on pre-GLOAS chains the index is maintained but never read. On a high-validator-count network (Hoodi, ~1.2M active validators) this maintenance was the single largest CPU consumer (`RemoveVote`, ~15% of CPU, fresh slice allocation per call) plus a large GC load — all under the fork-choice write lock. The lock contention delayed `OnBlock` (block import) and `get_head` past the attestation deadline, so the head served to the validator client was stale → wrong head votes. ## Fix Gate the indexed-store maintenance to the GLOAS vote path only, via an explicit `maintainIndexedVotes` flag on `setLatestMessage` (mirroring the existing `updateLatestMessages` pre-GLOAS/GLOAS dispatch). ## Validation (live Hoodi node, before vs after rebuild) - Node CPU (operator Grafana): ~7.5% → <2%. - 60s CPU profile: total samples 148.9% → 41.6%; `indexedWeightStore.RemoveVote` (was #1 hotspot) and the GC storm both gone. - Head-at-attestation-deadline staleness: 64% → 4%. - Validator head-vote misses: 40% → 0% in the first fully post-fix epoch (network sample ~23% in the same epoch); target/source unaffected throughout. Affects all pre-GLOAS networks on this branch (mainnet/Sepolia/Gnosis), with impact scaling by validator count. ## Tests - `TestPreGloasDoesNotMaintainIndexedWeightStore` — fails before the fix (the pre-GLOAS path populates the index), passes after. - `TestGloasMaintainsIndexedWeightStore` — locks in that the GLOAS path still indexes votes. - `go test ./cl/phase1/forkchoice/...`, `make lint`, `make erigon` all clean. ## Follow-up `indexedWeightStore` is currently unused even on GLOAS (`get_head` uses the non-indexed store). The Caplin team should either wire it into GLOAS `get_head` or remove it; tracking separately.
Summary
Full GLOAS (Glamsterdam) fork implementation for Caplin, Erigon's embedded consensus layer client. GLOAS introduces enshrined proposer-builder separation (ePBS) via EIP-7732.
Core implementation
ApplyParentExecutionPayload, epoch processing with PTC window stateSync & networking
Stability fixes
Spec compliance
Testing
Test plan