db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore by sudeepdino008 · Pull Request #21526 · erigontech/erigon

sudeepdino008 · 2026-05-30T08:50:26Z

Problem

block retirement holds snapshotBuild semaphore (snBuildAllowed, default size 1) across both collation and merge.
this semaphore is also shared by state aggregator.
when a large block merge runs (100k blocks), it blocks state collation, causing steps_in_db to accumulate (specially during catchup phase).
depends on db/snapshotsync: fix crash due to double close of decompressor #21545

Observed on a minimal node: an 8.2 GB / ~19-minute 025100→025200 transactions merge held the semaphore the whole time, starving state collation — stepsInDB climbed to 3.82 (chaindata bloats while collation waits).

Change

Release the semaphore after the collate phase and run the merge in its own goroutine, off the semaphore — mirroring how the aggregator runs MergeLoop.
The fast dump stays serialized against state-snapshot building (preserving the I/O-throttle intent); the slow merge no longer starves state collation.

Effect

With the same 8.2 GB block-tx merge in flight, state collation now proceeds concurrently — stepsInDB stays < 2 (was 3.82). The CLI retire path calls MergeBlocks explicitly so its behavior is unchanged.

Tests

TestBlockMergeRunsWithoutSemaphore — merge completes while the semaphore is fully held (regression-guards re-acquisition).
TestRetireBlocksInBackgroundReleasesSemaphore — background retire releases the semaphore, no leak.

Draft — full make lint + integration verification pending.

sudeepdino008 · 2026-06-01T07:48:54Z

Stacked on #21545 (refcount-safe closeWhatNotInList). This off-semaphore merge is only safe with that fix — without it the long merge's View gets closed out from under it by a concurrent OpenFolder, causing a use-after-close crash. Base is temporarily sudeep/snapshot-close-refcount-safety; will rebase onto main once #21545 merges.

…List closeWhatNotInList closed dirty segments that were not in the canonical on-disk list without checking their refcount, so it could close a segment (nilling its decompressor) while a live View/RoTx still held it. After a merge, TypedSegments->NoOverlaps drops the merged-away sub-segments from the list even though their files remain on disk; a concurrent OpenFolder then closed those subsumed segments out from under the merge's own View, and the View's later closeAndRemoveFiles hit FilePath() on the nil decompressor and crashed. Skip refcount-held segments in closeWhatNotInList; they are reaped on a later pass once the reader releases them. View/BeginRo stays lock-free.

… build semaphore Block retire held the shared snapshot-build semaphore across both the dump and the slow, expensive merge, blocking state-snapshot collation/prune for the merge's full duration. Release the semaphore after the dump and run the merge in its own goroutine (mirroring the aggregator's MergeLoop). The fast dump stays serialized against state building; the slow merge no longer starves state collation.

…ntech#21545) ## Problem - `dirtySegment.close()` (closes seg and idx) can happen on subsegments once some collation does `OpenFolder`, which uses `TypedSegments` which closes the subsegments. - `closeWhatNotInList`-- merge calls this and can crash because of close earlier. - maybe user in erigontech#19930 observed this - This started happening more after I tried to take snapshot merge off the build semaphore - erigontech#21526 ``` panic: runtime error: invalid memory address or nil pointer dereference seg.(*Decompressor).FilePath snapshotsync.(*DirtySegment).closeAndRemoveFiles snapshots.go:420 snapshotsync.(*RoTx).Close snapshots.go:537 snapshotsync.(*View).Close ``` ## Fix In `closeWhatNotInList`, skip segments with `refcount > 0`: a live reader still references them, so closing now would invalidate that reader. They are reaped on a later pass once the reader releases them (`closeWhatNotInList` already runs on every `OpenFolder`). `View`/`BeginRo` stays lock-free (erigontech#20490) — the fix is purely in the close path. ## Test `TestCloseWhatNotInListVsLiveViewDoesNotCrash` reproduces the crash deterministically (pure `snapshotsync`, no merge machinery): it builds sub-segments, opens a `View` over them, drops a covering merged file on disk, reopens (so `NoOverlaps` removes the subs from the list), and asserts `View.Close` does not crash. It fails before this change and passes after. Co-authored-by: Sudeep Kumar <sudeep.kumar@erigon.tech>

awskii

Order after #21571. Detaching the merge and clearing working before it finishes lets the next cycle's dump+OpenFolder run mid-merge. The merge publishes its .seg before the .idx (merger.go); with no registry on this branch, openSegments opens the bare .seg as a duplicate DirtySegment, then integrateMergedDirtyFiles→dirtySegments.Set drops it without closing the Decompressor (no finalizer) → mmap/fd leak, plus buildMissedIndices rebuilds the same .idx. This is the race #21571 closes, and the two aren't stacked (siblings off main) — merge #21571 first and rebase this onto it. One stale call after that rebase: DumpBlocks(...) at block_retire_merge_test.go:36 (#21571's new inProgress arg) — auto-merges clean, won't compile.

Bor still under the semaphore. retireBorBlocks→MergeBorBlocks (bor_snapshots.go:108) still runs inline inside RetireBlocks while snBuildAllowed is held, so bor chains keep the collation stall this targets.

Dead code: WaitForMerges() only has a test caller — the merge isn't awaited at shutdown (Stop() closes chainDB without waiting on mergeWg). Wire it into Stop() or drop it.

…ther files building events (erigontech#21571) ## Problem - BuildMissedIndices/openfolder/merge -- seem to have a race. e.g for ongoing merge if seg file is created and idx is not. And then we call `BuildMissedIndices`, it'll try to build the idx (even though merge goroutine is already building one). - erigontech#21526: allowing merge and RetireBlocks->BuildMissedIndices to run concurrently which exposes this issue. (before this PR, merge was under the semaphore, so collate+merge were okay) - even without the above PR, BuildMissedIndices is not safe (it can be called when starting erigon) to use concurrently with collate/merge or other BuildMissedIndices - this PR will make erigontech#21526 safe to merge ## Fix A per-`(type, from, to)` in-flight registry on `RoSnapshots`: - `TryAcquireRange` / `ReleaseRange` — atomic claim (`LoadOrStore`); `false` ⇒ another builder owns it. - `isInProgress` — read-only check. Wired in: - **merge** claims every `(type,range)` it produces before writing; a range it can't fully claim is being built elsewhere → skipped this pass; released after integrate. - **dump** (`dumpRange`) claims its `(type,range)` across `.seg`+`.idx` build. - **`BuildMissedIndices`** `TryAcquire`s per segment, skips if held — so it never rebuilds an in-flight file and concurrent recovery calls can't double-build. - **`openSegments`** skips in-progress ranges (read-only), so it never creates a duplicate `DirtySegment` for a half-built file. Once the owning builder is done processing, it can readd to `dirtyFiles` and do `OpenFolder`, so we don't miss the new file. --------- Co-authored-by: Sudeep Kumar <sudeep.kumar@erigon.tech>

…f-build-semaphore # Conflicts: # db/snapshotsync/freezeblocks/block_snapshots.go # db/snapshotsync/snapshots_test.go

…osing chainDB

sudeepdino008 · 2026-06-12T06:35:22Z

Order after #21571. Detaching the merge and clearing working before it finishes lets the next cycle's dump+OpenFolder run mid-merge. The merge publishes its .seg before the .idx (merger.go); with no registry on this branch, openSegments opens the bare .seg as a duplicate DirtySegment, then integrateMergedDirtyFiles→dirtySegments.Set drops it without closing the Decompressor (no finalizer) → mmap/fd leak, plus buildMissedIndices rebuilds the same .idx. This is the race #21571 closes, and the two aren't stacked (siblings off main) — merge #21571 first and rebase this onto it. One stale call after that rebase: DumpBlocks(...) at block_retire_merge_test.go:36 (#21571's new inProgress arg) — auto-merges clean, won't compile.

Bor still under the semaphore. retireBorBlocks→MergeBorBlocks (bor_snapshots.go:108) still runs inline inside RetireBlocks while snBuildAllowed is held, so bor chains keep the collation stall this targets.

Dead code: WaitForMerges() only has a test caller — the merge isn't awaited at shutdown (Stop() closes chainDB without waiting on mergeWg). Wire it into Stop() or drop it.

bor - don't care
WaitForMerges - done

Copilot

Pull request overview

This PR changes block snapshot retirement so the shared snapshot-build semaphore is held only for the fast “dump/collate” phase, while the slow “merge” phase runs asynchronously off the semaphore. This prevents long block snapshot merges from starving state snapshot collation/pruning (which can otherwise allow chaindata growth during catchup), and adds shutdown/test hooks to ensure background merges drain safely.

Changes:

Release snBuildAllowed after RetireBlocks (dump phase) and run MergeBlocks in a separate goroutine guarded by a single-merge gate.
Add WaitForMerges() to BlockRetire and call it during node shutdown to avoid DB-close hangs from in-flight merge RO transactions.
Add regression tests ensuring merges don’t block on the shared semaphore and that the semaphore is released in the background retire path; update retire CLI to explicitly call MergeBlocks.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
node/eth/backend.go	Waits for background block-snapshot merges to complete before closing `chainDB`.
db/snapshotsync/freezeblocks/block_snapshots.go	Splits retire into “dump under semaphore” + “merge off semaphore”; adds merge tracking/waiting API.
db/snapshotsync/freezeblocks/block_retire_merge_test.go	Adds tests covering semaphore-independence of merges and semaphore release in background retire.
db/services/interfaces.go	Extends `services.BlockRetire` interface with `WaitForMerges()`.
cmd/utils/app/snapshots_cmd.go	Ensures CLI retire path triggers merge explicitly after retiring blocks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 	maxScheduledBlock  atomic.Uint64
 	working            atomic.Bool
+	merging            atomic.Bool
+	mergeWg            sync.WaitGroup
 	lastRetireGapStart atomic.Uint64


+func (br *BlockRetire) mergeBlocksInBackground(ctx context.Context, lvl log.Lvl, seeder downloader.SeederClient) {
+	if !br.merging.CompareAndSwap(false, true) {
+		return
+	}
+	br.mergeWg.Add(1)
+	go func() {
+		defer br.mergeWg.Done()
+		defer br.merging.Store(false)
+		if _, err := br.MergeBlocks(ctx, lvl, seeder); err != nil {
+			br.logger.Error("[snapshots] merge blocks", "err", err)
+		}
+	}()
+}
+
+// WaitForMerges blocks until in-flight background block merges complete.
+func (br *BlockRetire) WaitForMerges() { br.mergeWg.Wait() }


+	if _, err := br.MergeBlocks(ctx, log.LvlInfo, downloader.NoopSeederClient{}); err != nil {
+		return err
+	}


sudeepdino008 mentioned this pull request Jun 1, 2026

db/snapshotsync: fix crash due to double close of decompressor #21545

Merged

sudeepdino008 force-pushed the sudeep/block-merge-off-build-semaphore branch from 71413d1 to fd4ecdb Compare June 1, 2026 07:48

sudeepdino008 changed the base branch from main to sudeep/snapshot-close-refcount-safety June 1, 2026 07:48

sudeepdino008 marked this pull request as ready for review June 1, 2026 08:30

sudeepdino008 requested review from AskAlexSharov and yperbasis as code owners June 1, 2026 08:30

sudeepdino008 changed the title ~~db/snapshotsync/freezeblocks: run block-snapshot merge off the shared build semaphore~~ db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore Jun 1, 2026

sudeepdino008 changed the title ~~db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore~~ [wip] db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore Jun 1, 2026

sudeepdino008 force-pushed the sudeep/snapshot-close-refcount-safety branch from 1fa270f to c2eba3d Compare June 1, 2026 08:32

sudeepdino008 force-pushed the sudeep/block-merge-off-build-semaphore branch from fd4ecdb to 85779d4 Compare June 1, 2026 08:37

sudeepdino008 marked this pull request as draft June 1, 2026 08:39

Base automatically changed from sudeep/snapshot-close-refcount-safety to main June 1, 2026 10:22

sudeepdino008 mentioned this pull request Jun 2, 2026

db/snapshotsync: make index building and opening segments safe with other files building events #21571

Merged

sudeepdino008 requested a review from awskii June 5, 2026 05:08

awskii reviewed Jun 5, 2026

View reviewed changes

sudeepdino008 mentioned this pull request Jun 8, 2026

initial sync: big BlockTransactions table and stepsInDB=5 #21326

Open

sudeepdino008 added this to the 3.6.0 milestone Jun 8, 2026

yperbasis modified the milestones: 3.6.0, 3.7.0 Jun 10, 2026

Merge remote-tracking branch 'origin/main' into sudeep/block-merge-of…

77a3d48

…f-build-semaphore # Conflicts: # db/snapshotsync/freezeblocks/block_snapshots.go # db/snapshotsync/snapshots_test.go

sudeepdino008 marked this pull request as ready for review June 12, 2026 06:29

node/eth: wait for background block-snapshot merges in Stop before cl…

d05e586

…osing chainDB

sudeepdino008 changed the title ~~[wip] db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore~~ db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore Jun 12, 2026

yperbasis requested a review from Copilot June 12, 2026 08:14

Copilot started reviewing on behalf of yperbasis June 12, 2026 08:14 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore#21526

db/snapshotsync/freezeblocks: run blocks snapshot merge off the shared build semaphore#21526
sudeepdino008 wants to merge 4 commits into
mainfrom
sudeep/block-merge-off-build-semaphore

sudeepdino008 commented May 30, 2026 •

edited

Loading

Uh oh!

sudeepdino008 commented Jun 1, 2026

Uh oh!

awskii left a comment

Uh oh!

sudeepdino008 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sudeepdino008 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Effect

Tests

Uh oh!

sudeepdino008 commented Jun 1, 2026

Uh oh!

awskii left a comment

Choose a reason for hiding this comment

Uh oh!

sudeepdino008 commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudeepdino008 commented May 30, 2026 •

edited

Loading