Skip to content

cl/forkchoice: fix Caplin stuck on mainnet after GLOAS merge#21228

Merged
yperbasis merged 3 commits into
mainfrom
kewei/fix_basestate_not_found_35
May 19, 2026
Merged

cl/forkchoice: fix Caplin stuck on mainnet after GLOAS merge#21228
yperbasis merged 3 commits into
mainfrom
kewei/fix_basestate_not_found_35

Conversation

@domiwei

@domiwei domiwei commented May 16, 2026

Copy link
Copy Markdown
Member

Summary

Fixes #21272 — Caplin gets stuck with baseState not found in graph on mainnet after the GLOAS merge (#18956).

Root cause: The GLOAS PR added maxSSZBufferSize = 128 MB in readBeaconStateFromDisk. Mainnet beacon state with ~1.5M validators is ~327 MB after decompression, so every state read was rejected as "corrupt". This made getCheckpointState always fail.

Three additional issues compounded the problem:

  • Duplicate InitBeaconState: DecodeSSZ already calls InitBeaconState internally (ssz.go:40). The explicit second call doubled the decode time by rebuilding the pubkey index for ~1.5M validators.

  • Lock contention: getHead held f.mu.Lock while calling getCheckpointState, which reads and decodes a large state from disk. This blocked the OnTick goroutine for the entire duration.

  • Missing fallback: After checkpoint sync, the justified checkpoint root may predate the anchor and not exist in the fork graph. Before GLOAS this was masked because forward sync always ran to completion (no stale timeout).

Changes

  • Replace 128 MB maxSSZBufferSize with a 1 GiB maxSSZObjectSize cap in readBeaconStateFromDisk / ReadEnvelopeFromDisk — ~3x headroom over current mainnet ~327 MB states while still bounding make([]byte, length) against a corrupt length field
  • Remove duplicate InitBeaconState() in readBeaconStateFromDisk
  • Move getCheckpointState before f.mu.Lock() in getHead
  • Fall back to anchor state in getCheckpointState when checkpoint root is not in graph

Test plan

  • Verified on dev-bm-e3-sepolia-n1 (mainnet, fresh datadir):
    • Before: baseState not found in graph on every ForkChoice round
    • After: Imported chain segment every ~12s, ForkChoice in ~2.9s
  • Existing unit tests pass (TestGetState_InfiniteLoopOnMissingStateFile, TestForkGraphInDisk)

🤖 Generated with Claude Code

@domiwei domiwei requested a review from Giulio2002 as a code owner May 16, 2026 19:29
Three issues introduced by the GLOAS PR caused Caplin to get stuck
on mainnet after checkpoint sync:

1. maxSSZBufferSize (128 MB) too small for mainnet beacon state
   (~327 MB uncompressed with ~1.5M validators). readBeaconStateFromDisk
   rejected the state as "corrupt", so getState could never load a
   saved state from disk.  Raise the limit to 512 MB.

2. Duplicate InitBeaconState call in readBeaconStateFromDisk.
   DecodeSSZ already calls InitBeaconState internally (ssz.go:40).
   The explicit second call rebuilt the pubkey index for ~1.5M
   validators redundantly, doubling the decode time.  Remove it.

3. getHead held f.mu.Lock while calling getCheckpointState, which
   reads and decodes a ~327 MB state from disk. This blocked the
   OnTick goroutine (which also needs f.mu) for the entire duration.
   Move getCheckpointState before f.mu.Lock — it only uses atomic
   values, sync.Map, and its own stateDumpLock.

Additionally, after checkpoint sync the justified checkpoint root
may point to a block that predates the anchor and is not in the
fork graph.  Before GLOAS this was masked because forward sync ran
to completion.  The GLOAS stale timeout can exit early, leaving
justified at the pre-anchor value.  Add a fallback to the anchor
state in getCheckpointState.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@domiwei domiwei force-pushed the kewei/fix_basestate_not_found_35 branch from bf25e31 to d596b07 Compare May 16, 2026 19:36
@yperbasis yperbasis requested a review from Copilot May 18, 2026 09:30
@yperbasis yperbasis added the Caplin Caplin: Consensus Layer, Beacon API label May 18, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Caplin forkchoice stalls after the GLOAS merge by unblocking beacon-state reads/decodes on mainnet-sized states and reducing lock contention in head computation.

Changes:

  • Removes the fixed SSZ size cap and avoids redundant beacon-state cache initialization during disk reads.
  • Computes checkpoint state before acquiring the forkchoice mutex to prevent long OnTick stalls.
  • Adds a fallback to use the anchor state when a checkpoint state can’t be obtained from the fork graph.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
cl/phase1/forkchoice/utils.go Adds anchor-state fallback in getCheckpointState when the checkpoint state can’t be retrieved.
cl/phase1/forkchoice/get_head.go Moves checkpoint-state computation outside f.mu to reduce lock contention during head finding.
cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go Removes max-size checks and redundant cache re-init during state/envelope disk reads.
Comments suppressed due to low confidence (2)

cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go:82

  • Removing the SSZ length upper bound means length (read from disk) can trigger an unbounded make([]byte, length) before any SSZ validation runs. A corrupted/truncated file can therefore cause a panic (len out of range) or OOM during allocation. Consider keeping a defensive high-water mark (possibly configurable or derived from on-disk file size) before resizing f.sszBuffer, and return a clear error when exceeded.

This issue also appears on line 223 of the same file.

		return nil, fmt.Errorf("failed to read length: %d, want 8, root: %x", n, blockRoot)
	}

	length := binary.BigEndian.Uint64(lengthBytes)
	if length > uint64(cap(f.sszBuffer)) {
		f.sszBuffer = make([]byte, length)
	} else {
		f.sszBuffer = f.sszBuffer[:length]
	}

cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go:231

  • Similar to beacon states, envelopeLength is trusted to size f.sszBuffer with no sanity limit. A corrupted envelope file can force a huge allocation/panic before DecodeSSZ runs. Add a defensive upper bound (even if much larger than before) or validate against the underlying file size before allocating.
		return nil, fmt.Errorf("failed to read length: %d, want 8, root: %x", n, blockRoot)
	}

	envelopeLength := binary.BigEndian.Uint64(lengthBytes)
	if envelopeLength > uint64(cap(f.sszBuffer)) {
		f.sszBuffer = make([]byte, envelopeLength)
	} else {
		f.sszBuffer = f.sszBuffer[:envelopeLength]
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cl/phase1/forkchoice/utils.go Outdated

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice diagnosis and the four issues are all real, but I'd like to see a couple of correctness/safety concerns addressed before merging.

Major

1. Anchor fallback cached under the wrong key

In utils.go, after the fallback to the anchor state:

f.checkpointStates.Store(checkpoint, checkpointState)

checkpointState is built from the anchor — actives, activeBalance, and state.Epoch(baseState.BeaconState) reflect the anchor's epoch, not checkpoint.Epoch. This gets cached under the missing checkpoint's key and serves all subsequent calls for that checkpoint, so:

  • Attestation weights in computeVotes use anchor-epoch active sets / balances rather than the checkpoint-epoch ones.
  • If the real checkpoint state becomes available later (e.g. via backfill), the cache keeps returning the stale fallback.

Options: build actives/balances from a state aligned to checkpoint.Epoch, or skip caching the fallback so the next call re-resolves. At minimum, document the inaccuracy and consider not caching when baseState came from the fallback path.

2. Latent data race on forkGraphDisk.currentState / currentStateBlockRoot

The pre-existing implicit protection was that OnBlockAddChainSegment writes f.currentState/f.currentStateBlockRoot under f.mu, and all callers of forkGraph.GetState were also under f.mu. Moving getCheckpointState out from under f.mu means the read at fork_graph_disk.go:480 (if f.currentState != nil && f.currentStateBlockRoot == blockRoot) can now race with the AddChainSegment write. Worst case is a torn 32-byte hash read causing a spurious cache miss (still correct, just slower), but the race detector will likely flag this. Consider giving forkGraphDisk its own RWMutex over those two fields, or atomic.Pointer[CachingBeaconState] + a parallel atomic hash.

Minor

3. Unbounded allocation from disk length field

Removing maxSSZBufferSize lets a corrupt or truncated *.snappy_ssz file drive make([]byte, length) with an arbitrary 64-bit value before any SSZ validation. It's our own datadir (not an attack surface), but a partially-written file after an unclean shutdown could now OOM/panic instead of returning a clean error. Consider keeping a much more generous cap (e.g. bump to 1 GiB, or bound by file size × max snappy expansion).

4. Stale error message after the fallback

if baseState == nil {
    return nil, errors.New("getCheckpointState: baseState not found in graph")
}

After the anchor fallback also fails, this still says "not found in graph". Suggest "getCheckpointState: baseState and anchor fallback both missing".

5. Performance: anchor re-read per missing checkpoint

Each new justified checkpoint that misses the graph triggers a fresh anchor-state read from disk (~327 MB on mainnet). A small LRU (or just remembering the anchor *checkpointState) would amortize this during the post-sync window.

Test coverage

The bug was a silent multi-month stall on mainnet that escaped CI. Worth adding:

  • A unit test for the getCheckpointState fallback path (fork graph returns nil for the checkpoint root, verify anchor is consulted).
  • A unit test for readBeaconStateFromDisk round-tripping a >128 MB blob to lock in that mainnet-sized states are accepted.

…allback

Address yperbasis's review on #21228:

Major 1 — Remove anchor fallback from getCheckpointState. The fallback
built a checkpointState from the anchor's validator set and registered
wrong pubkeys via publicKeysRegistry.AddState, causing attestation
verification to reject valid attestations. Instead, guard in GetHead:
when the justified root is absent from the fork graph, return the anchor
as head. This covers both getHead and getHeadGloas paths.

Major 2 — Add currentStateMu (sync.RWMutex) to forkGraphDisk. Moving
getCheckpointState outside forkchoice mu exposed a data race on
currentState/currentStateBlockRoot between getState reads and
AddChainSegment writes. Reads snapshot under RLock; writes take Lock.

Minor 3 — Replace removed maxSSZBufferSize with maxSSZObjectSize (1 GiB)
for both beacon state and envelope reads. Catches corrupt length fields
before OOM while accommodating mainnet's ~327 MB states.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@domiwei

domiwei commented May 18, 2026

Copy link
Copy Markdown
Member Author

@yperbasis Thanks for the thorough review. Pushed 1c66977 addressing all points:

Major 1 — Anchor fallback removed from getCheckpointState

Agreed the fallback was fundamentally wrong — it built a checkpointState with the anchor's validator set/balances and registered wrong pubkeys via publicKeysRegistry.AddState, which would reject valid attestations.

Removed the fallback entirely. Instead, added a guard at the top of GetHead (the public entry point, before the getHead/getHeadGloas dispatch):

if _, hasJustified := f.forkGraph.GetHeader(justifiedCheckpoint.Root); !hasJustified {
    return f.forkGraph.AnchorRoot(), f.forkGraph.AnchorSlot(), nil
}

This covers both pre-GLOAS and GLOAS paths. getCheckpointState now only returns correct data or errors.

Major 2 — currentStateMu RWMutex added

Added currentStateMu sync.RWMutex to forkGraphDisk. getState snapshots currentState/currentStateBlockRoot under RLock; all write sites (AddChainSegment success/error, getState transition error) take Lock. Also switched the diagnostic block after unlock to use newState instead of f.currentState.

Minor 3 — SSZ cap raised to 1 GiB

Replaced the removed 128 MB cap with maxSSZObjectSize = 1 << 30 (1 GiB) for both beacon state and envelope reads. ~3x headroom over current mainnet ~327 MB states.

Minor 4 — Error message

Moot — the fallback path is gone, so the original "baseState not found in graph" message is restored.

Minor 5 — Anchor re-read performance

Also moot — no anchor reads happen in getCheckpointState anymore.

Test coverage

Didn't add unit tests for the GetHead guard in this commit — the scenario (justified root absent from fork graph) requires a fork graph mock with missing headers, which is a bit involved. Happy to add if you think it's important.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment thread cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go
Comment on lines 98 to 103
// current state data — protected by currentStateMu when accessed
// concurrently (e.g. getCheckpointState runs outside forkchoice mu).
currentStateMu sync.RWMutex
currentState *state.CachingBeaconState
currentStateBlockRoot common.Hash // block root of the last processed block (avoids BlockRoot() zeroed-StateRoot issue)

Comment thread cl/phase1/forkchoice/get_head.go

@yperbasis yperbasis left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up commit 1c66977 cleanly addresses my prior review. One small nit from Copilot #2 worth tightening — the field comment on currentState / currentStateBlockRoot (fork_graph_disk.go:97-99) says they're "protected by currentStateMu when accessed concurrently", but two read sites still skip the mutex: isBlockRootTheCurrentState (lines 215-219) and the missing-segment diagnostic in AddChainSegment (lines 254-256). Currently safe because AddChainSegment only runs under the outer f.mu (held by OnBlock), so f.mu serializes those reads against the writer — but the comment promises a uniform invariant the code doesn't actually maintain, and a future read path added outside f.mu without currentStateMu would silently race.

Two options, your pick:

  1. Cheapest fix — wrap the two unguarded reads in currentStateMu.RLock() / RUnlock(). Overhead is negligible and the invariant becomes uniform.
  2. Narrow the doc — keep the dual-protection scheme but rewrite the field comment to spell it out, e.g. "Reads under f.mu rely on that lock (only AddChainSegment writes); reads outside f.mu must take currentStateMu".

Either is fine — approving so this isn't blocked.

@yperbasis yperbasis added this pull request to the merge queue May 19, 2026
@yperbasis yperbasis removed this pull request from the merge queue due to a manual request May 19, 2026
…ment

Address yperbasis's nit on #21228: the previous comment promised a
uniform "protected by currentStateMu when accessed concurrently"
invariant that the code doesn't actually maintain — isBlockRootTheCurrentState
and the AddChainSegment missing-segment diagnostic read these fields
without currentStateMu, relying on the outer f.mu to serialize against
the writer. Spell out the dual-protection scheme so a future read path
added outside f.mu doesn't silently race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yperbasis yperbasis enabled auto-merge May 19, 2026 10:10
@yperbasis yperbasis added this pull request to the merge queue May 19, 2026
Merged via the queue into main with commit 4f0a7fe May 19, 2026
68 checks passed
@yperbasis yperbasis deleted the kewei/fix_basestate_not_found_35 branch May 19, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Caplin Caplin: Consensus Layer, Beacon API

Projects

None yet

Development

Successfully merging this pull request may close these issues.

checkpoint state error, unable to get in sync (main branch)

3 participants