You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
State collation is parked at step 8903. Block snapshots cover up to txnum 3,477,899,736 (~225k txns short of the step-8904 boundary at 3,478,125,000), so even step 8904 can't be drained yet.
Why the cap made this worse
Before the cap, state files could outrun block files — unsafe (recovery required erigon seg rm-state --latest) but it kept the DB drained. The cap fixed the safety issue and exposed a cadence problem that was always there:
Block retirement runs only in the Prune phase — RetireBlocksInBackground is called from SnapshotsPrune (execution/stagedsync/stage_snapshots.go:442), after Execution finishes.
The cap stays frozen during ExecV3 — aggregator.readyForCollation re-reads FrozenBlocks() per step, but block snapshots don't grow during ExecV3.
Serial ExecV3 doesn't commit to MDBX mid-cycle — state accumulates in SharedDomains (buf=X/512MB log above); rwTx commits only when isBatchFull (exec3_serial.go:227) or blockLimit (exec3.go:688) trips. Steps committed in cycle N only become visible to the aggregator goroutine in cycle N+1's BuildFilesInBackground.
Net effect:
Cycle N: 5000 blocks / ~5–7 steps committed at cycle end
Prune: RetireBlocksInBackground → ~1 chunk → cap advances ~1 step
→ +5–7 steps committed, ≤1 step drained per cycle. Backlog grows.
Possible directions (suggestions, not a fix-list)
Eager block retirement driven off block progress rather than stage cadence, so the cap advances promptly as new blocks land instead of once-per-prune.
Revisit --sync.loop.block.limit (5000) and --batchSize (512M) defaults. With current mainnet density (~2M txs / 5000 blocks ≈ 5+ steps), one cycle straddles many step boundaries and the cap-induced backlog scales with cycle width. A batch sized to roughly one step might naturally keep the drain in pace.
Yield-on-collation in ExecV3. Mid-cycle BuildFiles is a non-starter because the rwTx isn't committed. Instead, ExecV3 could end the cycle early when it detects that the next step is collateable but blocked by uncommitted in-flight data — the outer loop commits, the next cycle's BuildFilesInBackground picks up the freshly-visible step. Trades cycle-start overhead for tighter drain cadence.
Problem
During snapshot-to-chaintip catchup on mainnet, executed state accumulates in DB faster than it can collate to state files.
result is at times ~5-7 steps is held in db. Which defeats the point of step size reduction.
Symptoms
Mainnet archive node, ~block 24.98M,
step_size=390625,steps_in_frozen_file=256. Two consecutive cycles:Cycle ending at 24,978,999 → next cycle 24,979,000 → 24,983,999 (5000 blocks, block-limit hit):
In-cycle progress shows the in-memory buffer climbing monotonically toward the 512MB cap — no mid-cycle drain to DB:
How wide is one cycle, in steps?
erigon seg txnumfor the cycle's boundary blocks:The cycle spans steps 8905 → 8912 ≈ 7 steps (
(3,481,512,634 − 3,478,848,826) / 390,625 ≈ 6.82).State collation is parked at step 8903. Block snapshots cover up to txnum 3,477,899,736 (~225k txns short of the step-8904 boundary at 3,478,125,000), so even step 8904 can't be drained yet.
Why the cap made this worse
Before the cap, state files could outrun block files — unsafe (recovery required
erigon seg rm-state --latest) but it kept the DB drained. The cap fixed the safety issue and exposed a cadence problem that was always there:RetireBlocksInBackgroundis called fromSnapshotsPrune(execution/stagedsync/stage_snapshots.go:442), after Execution finishes.aggregator.readyForCollationre-readsFrozenBlocks()per step, but block snapshots don't grow during ExecV3.SharedDomains(buf=X/512MBlog above); rwTx commits only whenisBatchFull(exec3_serial.go:227) orblockLimit(exec3.go:688) trips. Steps committed in cycle N only become visible to the aggregator goroutine in cycle N+1'sBuildFilesInBackground.Net effect:
Possible directions (suggestions, not a fix-list)
--sync.loop.block.limit(5000) and--batchSize(512M) defaults. With current mainnet density (~2M txs / 5000 blocks ≈ 5+ steps), one cycle straddles many step boundaries and the cap-induced backlog scales with cycle width. A batch sized to roughly one step might naturally keep the drain in pace.BuildFilesis a non-starter because the rwTx isn't committed. Instead, ExecV3 could end the cycle early when it detects that the next step is collateable but blocked by uncommitted in-flight data — the outer loop commits, the next cycle'sBuildFilesInBackgroundpicks up the freshly-visible step. Trades cycle-start overhead for tighter drain cadence.Refs
BuildFilesInBackgroundargaggregator.readyForCollation