simplify state aggregation and pruning logic by sudeepdino008 · Pull Request #21398 · erigontech/erigon

sudeepdino008 · 2026-05-25T08:03:56Z

issue: #21326.

Summary

The reorgBlockDepth gate is skipped when frozenBlocks is wired (block-snapshot retirement already factors reorgBlockDepth in, so the boundary check transitively inherits it). The legacy reorgBlockDepth>0 + frozenBlocks=nil arm is preserved.
- bloatnet has frozenBlocks disabled -- that's why reorgBlockDepth is still maintained. Might be possible to remove once some decisions on state collation strategy: capping or recovery #21366 is made.
CollateAndPrune collapsed to one prune + one BuildFilesInBackground call. targetSteps=1.5, the spin loop, and CollateAndPruneIfNeeded are gone — the inner buildFilesInBackground loop already drains every uncollated step in DB per invocation.
SetMaxCollationTxNum / MaxCollationTxNum / maxCollationTxNum atomic removed. The block-snapshot boundary gate inside readyForCollation is the single source of truth for "don't collate past block files"; the three external cap callers (exec3.go, stageloop.go, executor.go) were either redundant with that gate or strictly more permissive.
lastFlushedCommitmentTxNum removed; stepFullyCommitted -- does the same thing. So it was a duplicate mechanism. I'm sceptical of stepFullyCommitted too, but not convinced enough to remove it.
readyForCollation now reads lastCollatableStepInDB from TblAccountHistoryKeys.LastKey (matches StepsInDB) instead of TxNums.Last, which actually reflected chain head and made the log line misleading.

Test plan

make lint (two passes, 0 issues)
make erigon integration (clean build)
Smoke test on mirrored mainnet minimal datadir: stepsInDB stayed ~1.5, holding state collation at block snapshot boundary
fired with the renamed lastCollatableStepInDB field, no errors/panics. Cap math verified: step 9059 collated (allowed), step 9060
held ((9060+1)*stepSize > capTxNum).
CI: full test suite + race

sudeepdino008 · 2026-05-25T10:27:35Z

BlockTransaction is stable (refer to gh issue)

steps_in_db holding fine across multiple runs..

AskAlexSharov · 2026-05-25T10:38:44Z

steps_in_db on your picture i see it's= 2. Why not 1?
Run COLLECT_TABLE_SIZES_FREQUENCY=3s

AskAlexSharov · 2026-05-25T10:40:30Z

Here is how performance branch behave:

(i understand that state depends on blocks in main - but steps_in_db=2 looks like bug for me)

sudeepdino008 · 2026-05-25T11:10:35Z

steps_in_db on your picture i see it's= 2. Why not 1? Run COLLECT_TABLE_SIZES_FREQUENCY=3s

table sizes are fine (whole db around 10gb).

reason for steps_in_db close to 2 -- it's because of block collation cap state collation + block stepsize of 1000.
Mainnet is at 300-400 tx/block and state stepSize is 390k. So approximately, in worst case, we can get close 2 steps in db.

e.g. I'm at chaintip now:

exec head: 25,171,878 (9062.22 step)
last frozen block: 25,170,999 (file v1.1-025170-025171-headers.seg) -- 9061.58

which means 9061 cannot be collated. So we will have 1 step + the growing/active step of 9062 in db...

This is maybe more evidence for why we should reduce blocks stepsize.
In bloatnet it was so bad (because of state stepSize very low) that we had to disable the cap altogether so that it block stepSize didn't matter.

AskAlexSharov · 2026-05-25T11:37:18Z

because of block collation cap state collation - it's impossible to understand from Grafana. If tomorrow i will see steps_in_db=3 - is it bug or because cap by blocks?

AskAlexSharov · 2026-05-25T11:38:15Z

Does it means we need decrease Blocks files step size?

sudeepdino008 · 2026-05-25T12:03:04Z

because of block collation cap state collation - it's impossible to understand from Grafana. If tomorrow i will see steps_in_db=3 - is it bug or because cap by blocks?

assuming mainnet stays at 300-400 tx/block -- yes it's probably a bug. It shouldn't reach 3.

If it reaches 700-800 for couple of hours -- then 3 might be expected.
e.g. 0-1000 collatted (which might correspond to state step 700_000.9 i.e. 700_000 can't be collated) and chaintip is at 1990 -- block collation is as ahead as it can be; but those 990 blocks can contain 2 step worth of data (990block*800txs/390625 = 2.02). So total about 3 steps.

as a formula it's something like worse steps_in_db = 1 + avr_rate*1000/stepSize

Does it means we need decrease Blocks files step size?

yes. We should consider it specially given we're moving towards smaller step sizes.

there's an assumption about this - "block collation cap state" feature allows us to avoid "state ahead of blocks" error - @mh0lt seemed to have experience where aligning snapshots is the solution he's settled on to (instead of "just download blocks", which he said is fragile and bites later). I tried the "download blocks" solution, but got an error in a different place somedays later. So I'm convinced about "block collation cap state".

taratorio · 2026-05-25T12:12:24Z

Does it means we need decrease Blocks files step size?

I agree with Alex on this. We should always start collating as soon as 1 step is available in the DB and its last tx num is past the reorg range. If we are now introducing a "pause collation until frozen blocks" then it means that we need to decrease block files step size to 100 so that we maintain the "start collation of state as soon as a step is available and non-reorgable". Or, get rid of this "frozen block pause" and think of another solution to the "state ahead of blocks" error.

taratorio · 2026-05-25T12:13:57Z

I actually think that the "state ahead of blocks" error is solve-able without this. The way we had solved it before was we just let execution skip this error and let the node catch up and re-exec.

AskAlexSharov · 2026-05-25T12:25:28Z

I actually think that the "state ahead of blocks" error is solve-able without this. The way we had solved it before was we just let execution skip this error and let the node catch up and re-exec.

Mark scared it can cause non-canonical block to go to .seg files

AskAlexSharov · 2026-05-25T12:32:21Z

anyway - main has 5 steps in db - bug
if this PR solving it - then it's already step forward

taratorio · 2026-05-25T12:34:44Z

I actually think that the "state ahead of blocks" error is solve-able without this. The way we had solved it before was we just let execution skip this error and let the node catch up and re-exec.

Mark scared it can cause non-canonical block to go to .seg files

that's never happened.
sounds like we are creating complications for no reason

taratorio · 2026-05-25T12:37:07Z

anyway - main has 5 steps in db - bug if this PR solving it - then it's already step forward

the bug is due to a regression introduced by Mark's agent
nothing necessitated the addition of "frozen block cap on collation", just the regression needed fixing that's all
I personally don't like this solution

taratorio · 2026-05-25T12:38:23Z

we never had more than 1.2 steps in DB before, now we have 2+....

taratorio · 2026-05-25T12:59:35Z

I tried the "download blocks" solution, but got an error in a different place somedays later. So I'm convinced about "block collation cap state".

my suspicion is that "got an error in a different place somedays later" is not related to "skip state ahead of blocks error and let the node catch up" but is related to many regressions that got introduced to execution on main over the last few months

before we introduce this block cap I think we should investigate what this error was and what its real root cause is/was

AskAlexSharov · 2026-05-25T13:02:00Z

Agreed with Milen to switch back to #20546 as it will allow us have step_in_db=1 (and also it will be more predictable - less factors impacat-each-others). And on Bloatnet too (current solution can't survive on Bloatnet). Also it will allow us "don't touch BlockFiles" now - as they are stable for long time and touch them may delay parallel-exec release (regressions, etc...), also bottelnecks in chaindata now: Commitment.Domain (bloatnet) and Commitment.History (mainnet) - no much reasons to touch BlockFiles.

taratorio

This PR does some good cleanups, so let's merge it. My comments were about collation block cap vs recovery.

taratorio · 2026-05-26T01:10:57Z

This PR does some good cleanups, so let's merge it. My comments were about collation block cap vs recovery.

Going back to recovery instead of block cap can be follow up PR.

- Merge `origin/main` (up to #21546) into the `performance` branch. - Conflicts resolved by taking main's finalized form where the perf branch was behind (`ExistenceFilterVersion`→1 per #21164, `mdbx-go`→v0.40.1, `merge.go` `findMergeRangeInFiles` refactor, dropped `erigon-snapshot` module dep, fusefilter deferred-close refactor, `Versions.MustSupport`, atomic per-key prune throttle). - Adopted main's collation-at-tip design (`CollateAndPrune` in the FCU path, #21398/#21415) and removed the perf branch's older `frozenBlocks`-gating (`SetFrozenBlocksProvider`/`MaxCollatableTxNum`, `db/services/snapshot_progress.go`, its gating tests, and callers). - Verified: `make erigon integration` build, `make lint` (clean), `make test-short` (green).

sudeepdino008 added 3 commits May 25, 2026 13:32

save

8711095

save

735efb3

save

b7cf996

sudeepdino008 changed the title ~~[wip] simplify state aggregate and pruning logic~~ simplify state aggregate and pruning logic May 25, 2026

sudeepdino008 requested a review from taratorio May 25, 2026 10:23

sudeepdino008 marked this pull request as ready for review May 25, 2026 10:24

sudeepdino008 requested review from AskAlexSharov, mh0lt and yperbasis as code owners May 25, 2026 10:24

sudeepdino008 changed the title ~~simplify state aggregate and pruning logic~~ simplify state aggregation and pruning logic May 25, 2026

AskAlexSharov approved these changes May 25, 2026

View reviewed changes

taratorio approved these changes May 26, 2026

View reviewed changes

taratorio added this pull request to the merge queue May 26, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 26, 2026

AskAlexSharov added this pull request to the merge queue May 26, 2026

Merged via the queue into main with commit ee74ba5 May 26, 2026
83 of 134 checks passed

AskAlexSharov deleted the sudeep/collate-and-prune-simplify branch May 26, 2026 02:17

sudeepdino008 mentioned this pull request May 26, 2026

re-introduce block catchup recovery #21415

Merged

sudeepdino008 mentioned this pull request Jun 3, 2026

merge origin/main into performance #21595

Merged

sudeepdino008 mentioned this pull request Jun 8, 2026

initial sync: big BlockTransactions table and stepsInDB=5 #21326

Open

Conversation

sudeepdino008 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

sudeepdino008 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

sudeepdino008 commented May 25, 2026

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

sudeepdino008 commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taratorio commented May 25, 2026

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026

Uh oh!

taratorio commented May 25, 2026

Uh oh!

AskAlexSharov commented May 25, 2026

Uh oh!

taratorio left a comment

Choose a reason for hiding this comment

Uh oh!

taratorio commented May 26, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sudeepdino008 commented May 25, 2026 •

edited

Loading

sudeepdino008 commented May 25, 2026 •

edited

Loading

taratorio commented May 25, 2026 •

edited

Loading