cl/forkchoice: fix Caplin stuck on mainnet after GLOAS merge by domiwei · Pull Request #21228 · erigontech/erigon

domiwei · 2026-05-16T19:29:32Z

Summary

Fixes #21272 — Caplin gets stuck with baseState not found in graph on mainnet after the GLOAS merge (#18956).

Root cause: The GLOAS PR added maxSSZBufferSize = 128 MB in readBeaconStateFromDisk. Mainnet beacon state with ~1.5M validators is ~327 MB after decompression, so every state read was rejected as "corrupt". This made getCheckpointState always fail.

Three additional issues compounded the problem:

Duplicate InitBeaconState: DecodeSSZ already calls InitBeaconState internally (ssz.go:40). The explicit second call doubled the decode time by rebuilding the pubkey index for ~1.5M validators.
Lock contention: getHead held f.mu.Lock while calling getCheckpointState, which reads and decodes a large state from disk. This blocked the OnTick goroutine for the entire duration.
Missing fallback: After checkpoint sync, the justified checkpoint root may predate the anchor and not exist in the fork graph. Before GLOAS this was masked because forward sync always ran to completion (no stale timeout).

Changes

Replace 128 MB maxSSZBufferSize with a 1 GiB maxSSZObjectSize cap in readBeaconStateFromDisk / ReadEnvelopeFromDisk — ~3x headroom over current mainnet ~327 MB states while still bounding make([]byte, length) against a corrupt length field
Remove duplicate InitBeaconState() in readBeaconStateFromDisk
Move getCheckpointState before f.mu.Lock() in getHead
Fall back to anchor state in getCheckpointState when checkpoint root is not in graph

Test plan

Verified on dev-bm-e3-sepolia-n1 (mainnet, fresh datadir):
- Before: baseState not found in graph on every ForkChoice round
- After: Imported chain segment every ~12s, ForkChoice in ~2.9s
Existing unit tests pass (TestGetState_InfiniteLoopOnMissingStateFile, TestForkGraphInDisk)

🤖 Generated with Claude Code

Three issues introduced by the GLOAS PR caused Caplin to get stuck on mainnet after checkpoint sync: 1. maxSSZBufferSize (128 MB) too small for mainnet beacon state (~327 MB uncompressed with ~1.5M validators). readBeaconStateFromDisk rejected the state as "corrupt", so getState could never load a saved state from disk. Raise the limit to 512 MB. 2. Duplicate InitBeaconState call in readBeaconStateFromDisk. DecodeSSZ already calls InitBeaconState internally (ssz.go:40). The explicit second call rebuilt the pubkey index for ~1.5M validators redundantly, doubling the decode time. Remove it. 3. getHead held f.mu.Lock while calling getCheckpointState, which reads and decodes a ~327 MB state from disk. This blocked the OnTick goroutine (which also needs f.mu) for the entire duration. Move getCheckpointState before f.mu.Lock — it only uses atomic values, sync.Map, and its own stateDumpLock. Additionally, after checkpoint sync the justified checkpoint root may point to a block that predates the anchor and is not in the fork graph. Before GLOAS this was masked because forward sync ran to completion. The GLOAS stale timeout can exit early, leaving justified at the pre-anchor value. Add a fallback to the anchor state in getCheckpointState. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Fixes Caplin forkchoice stalls after the GLOAS merge by unblocking beacon-state reads/decodes on mainnet-sized states and reducing lock contention in head computation.

Changes:

Removes the fixed SSZ size cap and avoids redundant beacon-state cache initialization during disk reads.
Computes checkpoint state before acquiring the forkchoice mutex to prevent long OnTick stalls.
Adds a fallback to use the anchor state when a checkpoint state can’t be obtained from the fork graph.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`cl/phase1/forkchoice/utils.go`	Adds anchor-state fallback in `getCheckpointState` when the checkpoint state can’t be retrieved.
`cl/phase1/forkchoice/get_head.go`	Moves checkpoint-state computation outside `f.mu` to reduce lock contention during head finding.
`cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go`	Removes max-size checks and redundant cache re-init during state/envelope disk reads.

Comments suppressed due to low confidence (2)

cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go:82

Removing the SSZ length upper bound means length (read from disk) can trigger an unbounded make([]byte, length) before any SSZ validation runs. A corrupted/truncated file can therefore cause a panic (len out of range) or OOM during allocation. Consider keeping a defensive high-water mark (possibly configurable or derived from on-disk file size) before resizing f.sszBuffer, and return a clear error when exceeded.

This issue also appears on line 223 of the same file.

		return nil, fmt.Errorf("failed to read length: %d, want 8, root: %x", n, blockRoot)
	}

	length := binary.BigEndian.Uint64(lengthBytes)
	if length > uint64(cap(f.sszBuffer)) {
		f.sszBuffer = make([]byte, length)
	} else {
		f.sszBuffer = f.sszBuffer[:length]
	}

cl/phase1/forkchoice/fork_graph/fork_graph_disk_fs.go:231

Similar to beacon states, envelopeLength is trusted to size f.sszBuffer with no sanity limit. A corrupted envelope file can force a huge allocation/panic before DecodeSSZ runs. Add a defensive upper bound (even if much larger than before) or validate against the underlying file size before allocating.

		return nil, fmt.Errorf("failed to read length: %d, want 8, root: %x", n, blockRoot)
	}

	envelopeLength := binary.BigEndian.Uint64(lengthBytes)
	if envelopeLength > uint64(cap(f.sszBuffer)) {
		f.sszBuffer = make([]byte, envelopeLength)
	} else {
		f.sszBuffer = f.sszBuffer[:envelopeLength]
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yperbasis

Nice diagnosis and the four issues are all real, but I'd like to see a couple of correctness/safety concerns addressed before merging.

Major

1. Anchor fallback cached under the wrong key

In utils.go, after the fallback to the anchor state:

f.checkpointStates.Store(checkpoint, checkpointState)

checkpointState is built from the anchor — actives, activeBalance, and state.Epoch(baseState.BeaconState) reflect the anchor's epoch, not checkpoint.Epoch. This gets cached under the missing checkpoint's key and serves all subsequent calls for that checkpoint, so:

Attestation weights in computeVotes use anchor-epoch active sets / balances rather than the checkpoint-epoch ones.
If the real checkpoint state becomes available later (e.g. via backfill), the cache keeps returning the stale fallback.

Options: build actives/balances from a state aligned to checkpoint.Epoch, or skip caching the fallback so the next call re-resolves. At minimum, document the inaccuracy and consider not caching when baseState came from the fallback path.

2. Latent data race on `forkGraphDisk.currentState` / `currentStateBlockRoot`

The pre-existing implicit protection was that OnBlock → AddChainSegment writes f.currentState/f.currentStateBlockRoot under f.mu, and all callers of forkGraph.GetState were also under f.mu. Moving getCheckpointState out from under f.mu means the read at fork_graph_disk.go:480 (if f.currentState != nil && f.currentStateBlockRoot == blockRoot) can now race with the AddChainSegment write. Worst case is a torn 32-byte hash read causing a spurious cache miss (still correct, just slower), but the race detector will likely flag this. Consider giving forkGraphDisk its own RWMutex over those two fields, or atomic.Pointer[CachingBeaconState] + a parallel atomic hash.

Minor

3. Unbounded allocation from disk length field

Removing maxSSZBufferSize lets a corrupt or truncated *.snappy_ssz file drive make([]byte, length) with an arbitrary 64-bit value before any SSZ validation. It's our own datadir (not an attack surface), but a partially-written file after an unclean shutdown could now OOM/panic instead of returning a clean error. Consider keeping a much more generous cap (e.g. bump to 1 GiB, or bound by file size × max snappy expansion).

4. Stale error message after the fallback

if baseState == nil {
    return nil, errors.New("getCheckpointState: baseState not found in graph")
}

After the anchor fallback also fails, this still says "not found in graph". Suggest "getCheckpointState: baseState and anchor fallback both missing".

5. Performance: anchor re-read per missing checkpoint

Each new justified checkpoint that misses the graph triggers a fresh anchor-state read from disk (~327 MB on mainnet). A small LRU (or just remembering the anchor *checkpointState) would amortize this during the post-sync window.

Test coverage

The bug was a silent multi-month stall on mainnet that escaped CI. Worth adding:

A unit test for the getCheckpointState fallback path (fork graph returns nil for the checkpoint root, verify anchor is consulted).
A unit test for readBeaconStateFromDisk round-tripping a >128 MB blob to lock in that mainnet-sized states are accepted.

…allback Address yperbasis's review on #21228: Major 1 — Remove anchor fallback from getCheckpointState. The fallback built a checkpointState from the anchor's validator set and registered wrong pubkeys via publicKeysRegistry.AddState, causing attestation verification to reject valid attestations. Instead, guard in GetHead: when the justified root is absent from the fork graph, return the anchor as head. This covers both getHead and getHeadGloas paths. Major 2 — Add currentStateMu (sync.RWMutex) to forkGraphDisk. Moving getCheckpointState outside forkchoice mu exposed a data race on currentState/currentStateBlockRoot between getState reads and AddChainSegment writes. Reads snapshot under RLock; writes take Lock. Minor 3 — Replace removed maxSSZBufferSize with maxSSZObjectSize (1 GiB) for both beacon state and envelope reads. Catches corrupt length fields before OOM while accommodating mainnet's ~327 MB states. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

domiwei · 2026-05-18T15:59:52Z

@yperbasis Thanks for the thorough review. Pushed 1c66977 addressing all points:

Major 1 — Anchor fallback removed from `getCheckpointState`

Agreed the fallback was fundamentally wrong — it built a checkpointState with the anchor's validator set/balances and registered wrong pubkeys via publicKeysRegistry.AddState, which would reject valid attestations.

Removed the fallback entirely. Instead, added a guard at the top of GetHead (the public entry point, before the getHead/getHeadGloas dispatch):

if _, hasJustified := f.forkGraph.GetHeader(justifiedCheckpoint.Root); !hasJustified {
    return f.forkGraph.AnchorRoot(), f.forkGraph.AnchorSlot(), nil
}

This covers both pre-GLOAS and GLOAS paths. getCheckpointState now only returns correct data or errors.

Major 2 — `currentStateMu` RWMutex added

Added currentStateMu sync.RWMutex to forkGraphDisk. getState snapshots currentState/currentStateBlockRoot under RLock; all write sites (AddChainSegment success/error, getState transition error) take Lock. Also switched the diagnostic block after unlock to use newState instead of f.currentState.

Minor 3 — SSZ cap raised to 1 GiB

Replaced the removed 128 MB cap with maxSSZObjectSize = 1 << 30 (1 GiB) for both beacon state and envelope reads. ~3x headroom over current mainnet ~327 MB states.

Minor 4 — Error message

Moot — the fallback path is gone, so the original "baseState not found in graph" message is restored.

Minor 5 — Anchor re-read performance

Also moot — no anchor reads happen in getCheckpointState anymore.

Test coverage

Didn't add unit tests for the GetHead guard in this commit — the scenario (justified root absent from fork graph) requires a fork graph mock with missing headers, which is a bit involved. Happy to add if you think it's important.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

+	// current state data — protected by currentStateMu when accessed
+	// concurrently (e.g. getCheckpointState runs outside forkchoice mu).
+	currentStateMu        sync.RWMutex
 	currentState          *state.CachingBeaconState
 	currentStateBlockRoot common.Hash // block root of the last processed block (avoids BlockRoot() zeroed-StateRoot issue)



yperbasis

Follow-up commit 1c66977 cleanly addresses my prior review. One small nit from Copilot #2 worth tightening — the field comment on currentState / currentStateBlockRoot (fork_graph_disk.go:97-99) says they're "protected by currentStateMu when accessed concurrently", but two read sites still skip the mutex: isBlockRootTheCurrentState (lines 215-219) and the missing-segment diagnostic in AddChainSegment (lines 254-256). Currently safe because AddChainSegment only runs under the outer f.mu (held by OnBlock), so f.mu serializes those reads against the writer — but the comment promises a uniform invariant the code doesn't actually maintain, and a future read path added outside f.mu without currentStateMu would silently race.

Two options, your pick:

Cheapest fix — wrap the two unguarded reads in currentStateMu.RLock() / RUnlock(). Overhead is negligible and the invariant becomes uniform.
Narrow the doc — keep the dual-protection scheme but rewrite the field comment to spell it out, e.g. "Reads under f.mu rely on that lock (only AddChainSegment writes); reads outside f.mu must take currentStateMu".

Either is fine — approving so this isn't blocked.

…ment Address yperbasis's nit on #21228: the previous comment promised a uniform "protected by currentStateMu when accessed concurrently" invariant that the code doesn't actually maintain — isBlockRootTheCurrentState and the AddChainSegment missing-segment diagnostic read these fields without currentStateMu, relying on the outer f.mu to serialize against the writer. Spell out the dual-protection scheme so a future read path added outside f.mu doesn't silently race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

domiwei requested a review from Giulio2002 as a code owner May 16, 2026 19:29

domiwei force-pushed the kewei/fix_basestate_not_found_35 branch from bf25e31 to d596b07 Compare May 16, 2026 19:36

yperbasis requested a review from Copilot May 18, 2026 09:30

Copilot started reviewing on behalf of yperbasis May 18, 2026 09:31 View session

yperbasis added the Caplin Caplin: Consensus Layer, Beacon API label May 18, 2026

Copilot AI reviewed May 18, 2026

View reviewed changes

Comment thread cl/phase1/forkchoice/utils.go Outdated

yperbasis requested changes May 18, 2026

View reviewed changes

yperbasis requested a review from Copilot May 19, 2026 08:59

Copilot started reviewing on behalf of yperbasis May 19, 2026 08:59 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

yperbasis approved these changes May 19, 2026

View reviewed changes

yperbasis added this pull request to the merge queue May 19, 2026

yperbasis removed this pull request from the merge queue due to a manual request May 19, 2026

yperbasis mentioned this pull request May 19, 2026

checkpoint state error, unable to get in sync (main branch) #21272

Closed

yperbasis enabled auto-merge May 19, 2026 10:10

yperbasis added this pull request to the merge queue May 19, 2026

Merged via the queue into main with commit 4f0a7fe May 19, 2026
68 checks passed

yperbasis deleted the kewei/fix_basestate_not_found_35 branch May 19, 2026 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cl/forkchoice: fix Caplin stuck on mainnet after GLOAS merge#21228

cl/forkchoice: fix Caplin stuck on mainnet after GLOAS merge#21228
yperbasis merged 3 commits into
mainfrom
kewei/fix_basestate_not_found_35

domiwei commented May 16, 2026 •

edited by yperbasis

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

yperbasis left a comment

Uh oh!

domiwei commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

yperbasis left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

domiwei commented May 16, 2026 • edited by yperbasis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

yperbasis left a comment

Choose a reason for hiding this comment

Major

1. Anchor fallback cached under the wrong key

2. Latent data race on forkGraphDisk.currentState / currentStateBlockRoot

Minor

3. Unbounded allocation from disk length field

4. Stale error message after the fallback

5. Performance: anchor re-read per missing checkpoint

Test coverage

Uh oh!

domiwei commented May 18, 2026

Major 1 — Anchor fallback removed from getCheckpointState

Major 2 — currentStateMu RWMutex added

Minor 3 — SSZ cap raised to 1 GiB

Minor 4 — Error message

Minor 5 — Anchor re-read performance

Test coverage

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

yperbasis left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

domiwei commented May 16, 2026 •

edited by yperbasis

Loading

2. Latent data race on `forkGraphDisk.currentState` / `currentStateBlockRoot`

Major 1 — Anchor fallback removed from `getCheckpointState`

Major 2 — `currentStateMu` RWMutex added

yperbasis left a comment •

edited

Loading