simplify state aggregation and pruning logic#21398
Conversation
|
|
table sizes are fine (whole db around 10gb). reason for steps_in_db close to 2 -- it's because of block collation cap state collation + block stepsize of 1000. e.g. I'm at chaintip now:
which means 9061 cannot be collated. So we will have 1 step + the growing/active step of 9062 in db... This is maybe more evidence for why we should reduce blocks stepsize. |
|
|
|
Does it means we need decrease Blocks files step size? |
assuming mainnet stays at 300-400 tx/block -- yes it's probably a bug. It shouldn't reach 3. If it reaches 700-800 for couple of hours -- then 3 might be expected. as a formula it's something like worse steps_in_db = 1 + avr_rate*1000/stepSize
yes. We should consider it specially given we're moving towards smaller step sizes. there's an assumption about this - "block collation cap state" feature allows us to avoid "state ahead of blocks" error - @mh0lt seemed to have experience where aligning snapshots is the solution he's settled on to (instead of "just download blocks", which he said is fragile and bites later). I tried the "download blocks" solution, but got an error in a different place somedays later. So I'm convinced about "block collation cap state". |
I agree with Alex on this. We should always start collating as soon as 1 step is available in the DB and its last tx num is past the reorg range. If we are now introducing a "pause collation until frozen blocks" then it means that we need to decrease block files step size to 100 so that we maintain the "start collation of state as soon as a step is available and non-reorgable". Or, get rid of this "frozen block pause" and think of another solution to the "state ahead of blocks" error. |
|
I actually think that the "state ahead of blocks" error is solve-able without this. The way we had solved it before was we just let execution skip this error and let the node catch up and re-exec. |
Mark scared it can cause non-canonical block to go to .seg files |
|
anyway - |
that's never happened. |
the bug is due to a regression introduced by Mark's agent |
|
we never had more than 1.2 steps in DB before, now we have 2+.... |
my suspicion is that before we introduce this block cap I think we should investigate what this error was and what its real root cause is/was |
|
Agreed with Milen to switch back to #20546 as it will allow us have step_in_db=1 (and also it will be more predictable - less factors impacat-each-others). And on Bloatnet too (current solution can't survive on Bloatnet). Also it will allow us "don't touch BlockFiles" now - as they are stable for long time and touch them may delay parallel-exec release (regressions, etc...), also bottelnecks in chaindata now: Commitment.Domain (bloatnet) and Commitment.History (mainnet) - no much reasons to touch BlockFiles. |
taratorio
left a comment
There was a problem hiding this comment.
This PR does some good cleanups, so let's merge it. My comments were about collation block cap vs recovery.
Going back to recovery instead of block cap can be follow up PR. |
- Merge `origin/main` (up to #21546) into the `performance` branch. - Conflicts resolved by taking main's finalized form where the perf branch was behind (`ExistenceFilterVersion`→1 per #21164, `mdbx-go`→v0.40.1, `merge.go` `findMergeRangeInFiles` refactor, dropped `erigon-snapshot` module dep, fusefilter deferred-close refactor, `Versions.MustSupport`, atomic per-key prune throttle). - Adopted main's collation-at-tip design (`CollateAndPrune` in the FCU path, #21398/#21415) and removed the perf branch's older `frozenBlocks`-gating (`SetFrozenBlocksProvider`/`MaxCollatableTxNum`, `db/services/snapshot_progress.go`, its gating tests, and callers). - Verified: `make erigon integration` build, `make lint` (clean), `make test-short` (green).



issue: #21326.
Summary
reorgBlockDepthgate is skipped whenfrozenBlocksis wired (block-snapshot retirement already factorsreorgBlockDepthin, so the boundary check transitively inherits it). The legacyreorgBlockDepth>0 + frozenBlocks=nilarm is preserved.frozenBlocksdisabled -- that's why reorgBlockDepth is still maintained. Might be possible to remove once some decisions on state collation strategy: capping or recovery #21366 is made.CollateAndPrunecollapsed to one prune + oneBuildFilesInBackgroundcall.targetSteps=1.5, the spin loop, andCollateAndPruneIfNeededare gone — the innerbuildFilesInBackgroundloop already drains every uncollated step in DB per invocation.SetMaxCollationTxNum/MaxCollationTxNum/maxCollationTxNumatomic removed. The block-snapshot boundary gate insidereadyForCollationis the single source of truth for "don't collate past block files"; the three external cap callers (exec3.go,stageloop.go,executor.go) were either redundant with that gate or strictly more permissive.lastFlushedCommitmentTxNumremoved;stepFullyCommitted-- does the same thing. So it was a duplicate mechanism. I'm sceptical ofstepFullyCommittedtoo, but not convinced enough to remove it.readyForCollationnow readslastCollatableStepInDBfromTblAccountHistoryKeys.LastKey(matchesStepsInDB) instead ofTxNums.Last, which actually reflected chain head and made the log line misleading.Test plan
make lint(two passes, 0 issues)make erigon integration(clean build)stepsInDBstayed ~1.5,holding state collation at block snapshot boundaryfired with the renamed
lastCollatableStepInDBfield, no errors/panics. Cap math verified: step 9059 collated (allowed), step 9060held (
(9060+1)*stepSize > capTxNum).