cl/forkchoice: fix Caplin stuck on mainnet after GLOAS merge#21228
Conversation
Three issues introduced by the GLOAS PR caused Caplin to get stuck on mainnet after checkpoint sync: 1. maxSSZBufferSize (128 MB) too small for mainnet beacon state (~327 MB uncompressed with ~1.5M validators). readBeaconStateFromDisk rejected the state as "corrupt", so getState could never load a saved state from disk. Raise the limit to 512 MB. 2. Duplicate InitBeaconState call in readBeaconStateFromDisk. DecodeSSZ already calls InitBeaconState internally (ssz.go:40). The explicit second call rebuilt the pubkey index for ~1.5M validators redundantly, doubling the decode time. Remove it. 3. getHead held f.mu.Lock while calling getCheckpointState, which reads and decodes a ~327 MB state from disk. This blocked the OnTick goroutine (which also needs f.mu) for the entire duration. Move getCheckpointState before f.mu.Lock — it only uses atomic values, sync.Map, and its own stateDumpLock. Additionally, after checkpoint sync the justified checkpoint root may point to a block that predates the anchor and is not in the fork graph. Before GLOAS this was masked because forward sync ran to completion. The GLOAS stale timeout can exit early, leaving justified at the pre-anchor value. Add a fallback to the anchor state in getCheckpointState. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bf25e31 to
d596b07
Compare
There was a problem hiding this comment.
Pull request overview
Fixes Caplin forkchoice stalls after the GLOAS merge by unblocking beacon-state reads/decodes on mainnet-sized states and reducing lock contention in head computation.
Changes:
- Removes the fixed SSZ size cap and avoids redundant beacon-state cache initialization during disk reads.
- Computes checkpoint state before acquiring the forkchoice mutex to prevent long OnTick stalls.
- Adds a fallback to use the anchor state when a checkpoint state can’t be obtained from the fork graph.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
cl/phase1/forkchoice/utils.go |
Adds anchor-state fallback in getCheckpointState when the checkpoint state can’t be retrieved. |
cl/phase1/forkchoice/get_head.go |
Moves checkpoint-state computation outside f.mu to reduce lock contention during head finding. |
cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go |
Removes max-size checks and redundant cache re-init during state/envelope disk reads. |
Comments suppressed due to low confidence (2)
cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go:82
- Removing the SSZ length upper bound means
length(read from disk) can trigger an unboundedmake([]byte, length)before any SSZ validation runs. A corrupted/truncated file can therefore cause a panic (len out of range) or OOM during allocation. Consider keeping a defensive high-water mark (possibly configurable or derived from on-disk file size) before resizingf.sszBuffer, and return a clear error when exceeded.
This issue also appears on line 223 of the same file.
return nil, fmt.Errorf("failed to read length: %d, want 8, root: %x", n, blockRoot)
}
length := binary.BigEndian.Uint64(lengthBytes)
if length > uint64(cap(f.sszBuffer)) {
f.sszBuffer = make([]byte, length)
} else {
f.sszBuffer = f.sszBuffer[:length]
}
cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go:231
- Similar to beacon states,
envelopeLengthis trusted to sizef.sszBufferwith no sanity limit. A corrupted envelope file can force a huge allocation/panic beforeDecodeSSZruns. Add a defensive upper bound (even if much larger than before) or validate against the underlying file size before allocating.
return nil, fmt.Errorf("failed to read length: %d, want 8, root: %x", n, blockRoot)
}
envelopeLength := binary.BigEndian.Uint64(lengthBytes)
if envelopeLength > uint64(cap(f.sszBuffer)) {
f.sszBuffer = make([]byte, envelopeLength)
} else {
f.sszBuffer = f.sszBuffer[:envelopeLength]
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yperbasis
left a comment
There was a problem hiding this comment.
Nice diagnosis and the four issues are all real, but I'd like to see a couple of correctness/safety concerns addressed before merging.
Major
1. Anchor fallback cached under the wrong key
In utils.go, after the fallback to the anchor state:
f.checkpointStates.Store(checkpoint, checkpointState)checkpointState is built from the anchor — actives, activeBalance, and state.Epoch(baseState.BeaconState) reflect the anchor's epoch, not checkpoint.Epoch. This gets cached under the missing checkpoint's key and serves all subsequent calls for that checkpoint, so:
- Attestation weights in
computeVotesuse anchor-epoch active sets / balances rather than the checkpoint-epoch ones. - If the real checkpoint state becomes available later (e.g. via backfill), the cache keeps returning the stale fallback.
Options: build actives/balances from a state aligned to checkpoint.Epoch, or skip caching the fallback so the next call re-resolves. At minimum, document the inaccuracy and consider not caching when baseState came from the fallback path.
2. Latent data race on forkGraphDisk.currentState / currentStateBlockRoot
The pre-existing implicit protection was that OnBlock → AddChainSegment writes f.currentState/f.currentStateBlockRoot under f.mu, and all callers of forkGraph.GetState were also under f.mu. Moving getCheckpointState out from under f.mu means the read at fork_graph_disk.go:480 (if f.currentState != nil && f.currentStateBlockRoot == blockRoot) can now race with the AddChainSegment write. Worst case is a torn 32-byte hash read causing a spurious cache miss (still correct, just slower), but the race detector will likely flag this. Consider giving forkGraphDisk its own RWMutex over those two fields, or atomic.Pointer[CachingBeaconState] + a parallel atomic hash.
Minor
3. Unbounded allocation from disk length field
Removing maxSSZBufferSize lets a corrupt or truncated *.snappy_ssz file drive make([]byte, length) with an arbitrary 64-bit value before any SSZ validation. It's our own datadir (not an attack surface), but a partially-written file after an unclean shutdown could now OOM/panic instead of returning a clean error. Consider keeping a much more generous cap (e.g. bump to 1 GiB, or bound by file size × max snappy expansion).
4. Stale error message after the fallback
if baseState == nil {
return nil, errors.New("getCheckpointState: baseState not found in graph")
}After the anchor fallback also fails, this still says "not found in graph". Suggest "getCheckpointState: baseState and anchor fallback both missing".
5. Performance: anchor re-read per missing checkpoint
Each new justified checkpoint that misses the graph triggers a fresh anchor-state read from disk (~327 MB on mainnet). A small LRU (or just remembering the anchor *checkpointState) would amortize this during the post-sync window.
Test coverage
The bug was a silent multi-month stall on mainnet that escaped CI. Worth adding:
- A unit test for the
getCheckpointStatefallback path (fork graph returns nil for the checkpoint root, verify anchor is consulted). - A unit test for
readBeaconStateFromDiskround-tripping a >128 MB blob to lock in that mainnet-sized states are accepted.
…allback Address yperbasis's review on #21228: Major 1 — Remove anchor fallback from getCheckpointState. The fallback built a checkpointState from the anchor's validator set and registered wrong pubkeys via publicKeysRegistry.AddState, causing attestation verification to reject valid attestations. Instead, guard in GetHead: when the justified root is absent from the fork graph, return the anchor as head. This covers both getHead and getHeadGloas paths. Major 2 — Add currentStateMu (sync.RWMutex) to forkGraphDisk. Moving getCheckpointState outside forkchoice mu exposed a data race on currentState/currentStateBlockRoot between getState reads and AddChainSegment writes. Reads snapshot under RLock; writes take Lock. Minor 3 — Replace removed maxSSZBufferSize with maxSSZObjectSize (1 GiB) for both beacon state and envelope reads. Catches corrupt length fields before OOM while accommodating mainnet's ~327 MB states. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@yperbasis Thanks for the thorough review. Pushed Major 1 — Anchor fallback removed from
|
| // current state data — protected by currentStateMu when accessed | ||
| // concurrently (e.g. getCheckpointState runs outside forkchoice mu). | ||
| currentStateMu sync.RWMutex | ||
| currentState *state.CachingBeaconState | ||
| currentStateBlockRoot common.Hash // block root of the last processed block (avoids BlockRoot() zeroed-StateRoot issue) | ||
|
|
There was a problem hiding this comment.
Follow-up commit 1c66977 cleanly addresses my prior review. One small nit from Copilot #2 worth tightening — the field comment on currentState / currentStateBlockRoot (fork_graph_disk.go:97-99) says they're "protected by currentStateMu when accessed concurrently", but two read sites still skip the mutex: isBlockRootTheCurrentState (lines 215-219) and the missing-segment diagnostic in AddChainSegment (lines 254-256). Currently safe because AddChainSegment only runs under the outer f.mu (held by OnBlock), so f.mu serializes those reads against the writer — but the comment promises a uniform invariant the code doesn't actually maintain, and a future read path added outside f.mu without currentStateMu would silently race.
Two options, your pick:
- Cheapest fix — wrap the two unguarded reads in
currentStateMu.RLock()/RUnlock(). Overhead is negligible and the invariant becomes uniform. - Narrow the doc — keep the dual-protection scheme but rewrite the field comment to spell it out, e.g. "Reads under
f.murely on that lock (onlyAddChainSegmentwrites); reads outsidef.mumust takecurrentStateMu".
Either is fine — approving so this isn't blocked.
…ment Address yperbasis's nit on #21228: the previous comment promised a uniform "protected by currentStateMu when accessed concurrently" invariant that the code doesn't actually maintain — isBlockRootTheCurrentState and the AddChainSegment missing-segment diagnostic read these fields without currentStateMu, relying on the outer f.mu to serialize against the writer. Spell out the dual-protection scheme so a future read path added outside f.mu doesn't silently race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fixes #21272 — Caplin gets stuck with
baseState not found in graphon mainnet after the GLOAS merge (#18956).Root cause: The GLOAS PR added
maxSSZBufferSize = 128 MBinreadBeaconStateFromDisk. Mainnet beacon state with ~1.5M validators is ~327 MB after decompression, so every state read was rejected as "corrupt". This madegetCheckpointStatealways fail.Three additional issues compounded the problem:
Duplicate
InitBeaconState:DecodeSSZalready callsInitBeaconStateinternally (ssz.go:40). The explicit second call doubled the decode time by rebuilding the pubkey index for ~1.5M validators.Lock contention:
getHeadheldf.mu.Lockwhile callinggetCheckpointState, which reads and decodes a large state from disk. This blocked theOnTickgoroutine for the entire duration.Missing fallback: After checkpoint sync, the justified checkpoint root may predate the anchor and not exist in the fork graph. Before GLOAS this was masked because forward sync always ran to completion (no stale timeout).
Changes
maxSSZBufferSizewith a 1 GiBmaxSSZObjectSizecap inreadBeaconStateFromDisk/ReadEnvelopeFromDisk— ~3x headroom over current mainnet ~327 MB states while still boundingmake([]byte, length)against a corrupt length fieldInitBeaconState()inreadBeaconStateFromDiskgetCheckpointStatebeforef.mu.Lock()ingetHeadgetCheckpointStatewhen checkpoint root is not in graphTest plan
dev-bm-e3-sepolia-n1(mainnet, fresh datadir):baseState not found in graphon every ForkChoice roundImported chain segmentevery ~12s, ForkChoice in ~2.9sTestGetState_InfiniteLoopOnMissingStateFile,TestForkGraphInDisk)🤖 Generated with Claude Code