db/state: serialize Aggregator.Set*Workers to fix race with collate (blocks #21039 CI)#21062
Merged
Conversation
The race detector flagged a concurrent read/write on InvertedIndex.CompressorCfg.Workers when ExecV3's chain-tip-driven PresetChainTipConcurrency() ran while a background buildFiles goroutine was inside InvertedIndex.collate, which reads ii.CompressorCfg by value to construct a seg.NewCompressor. Stack from the failing race-tests run on PR #21039 CI: Read at 0x... by goroutine N: db/state.(*InvertedIndex).collate inverted_index.go:1003 db/state.(*Aggregator).buildFiles aggregator.go:927 Previous write at 0x... by goroutine M: db/state.(*Aggregator).SetCompressWorkers aggregator.go:623 db/state.(*Aggregator).PresetChainTipConcurrency aggregator.go:651 execution/stagedsync.ExecV3 exec3.go:131 The lockWorkersEditing flag already existed as a soft-skip for Set*Workers calls, but its raw-bool read/write under no mutex meant (a) the read of the flag could observe stale state and let the write proceed concurrently with collate, and (b) the LockWorkersEditing / UnlockWorkersEditing toggle wasn't being called around buildFiles. Fix: - Add Aggregator.workersMu sync.Mutex. - All four Set*Workers (Collate/Build, Merge, Compress, BuildAccessors) acquire workersMu before checking lockWorkersEditing and writing. The combined check+write under the mutex closes the load-then-write race window. - LockWorkersEditing / UnlockWorkersEditing acquire workersMu around the bool toggle, providing happens-before so any prior Set*Workers' write is visible to the buildFiles goroutine that subsequently reads CompressorCfg, and any concurrent Set*Workers blocks until the toggle is observable. - Aggregator.buildFiles wraps its body in LockWorkersEditing / defer UnlockWorkersEditing so the chain-tip-driven Set*Workers calls are no-oped for the duration of collation, eliminating the race. Verified locally: TestHistoryVerification_SimpleBlocks now passes under -race; full ./db/state/... -race -short clean.
5 tasks
sudeepdino008
approved these changes
May 8, 2026
Contributor
Author
|
@yperbasis @erigontech/maintainers — prioritising this for review: this small, self-contained race fix in |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR addresses a Go data race in db/state where ExecV3’s chain-tip-driven worker reconfiguration (PresetChainTipConcurrency → Set*Workers) can overlap with snapshot collation/build work (buildFiles) and concurrently mutate/read compressor configuration.
Changes:
- Added
Aggregator.workersMuand madelockWorkersEditingreads/writes occur under that mutex to establish a clear happens-before relationship. - Updated all
Set*Workersmethods to serialize the “check lock + write” sequence underworkersMu, avoiding unsynchronized flag checks and concurrent config writes. - Wrapped
Aggregator.buildFileswithLockWorkersEditing/UnlockWorkersEditingto suppress worker-config writes during collation/build reads.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+885
to
+886
| a.LockWorkersEditing() | ||
| defer a.UnlockWorkersEditing() |
Comment on lines
+720
to
+724
| func (a *Aggregator) LockWorkersEditing() { | ||
| a.workersMu.Lock() | ||
| defer a.workersMu.Unlock() | ||
| a.lockWorkersEditing = true | ||
| } |
Member
|
Do you have plans to cherry-pick it to r3.4 ? |
pull Bot
pushed a commit
to Dustin4444/erigon
that referenced
this pull request
May 13, 2026
## Summary - Extract `merge`, `collateAndBuild`, `allowEditing`, and the mutex from `Aggregator` into a new `aggWorkers` value type - Public `Set*Workers` methods become unexported helpers (`setCompressWorkers`, `setBuildAccessorsWorkers`); the merge/collate counts are accessed via `workers.getMerge()` / `workers.setMerge()` etc. - Remove the now-unused `SetCollateAndBuildWorkers` / `SetMergeWorkers` / `SetCompressWorkers` stubs from `ForkableAgg` No behaviour change — this is a structural cleanup follow-up to erigontech#21062. --------- Co-authored-by: lupin012 <58134934+lupin012@users.noreply.github.com> Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a long-standing data race that the Go race detector flags when a chain-tip
PresetChainTipConcurrencycall fromExecV3overlaps with a backgroundbuildFilescollation. The race is between:Aggregator.SetCompressWorkers(aggregator.go:623) —ii.CompressorCfg.Workers = iInvertedIndex.collate(inverted_index.go:1003) —seg.NewCompressor(..., ii.CompressorCfg, ...)reads the cfg by valueSurfaced as a CI failure on PR #21039 (where the race-tests job actually ran):
Root cause
Aggregator.lockWorkersEditingalready existed as a soft-skip flag forSet*Workers, but:if lockWorkersEditingcheck could observe a stalefalsewhilebuildFileshad just toggled it true, leading to a concurrentCompressorCfg.Workers = iwrite.LockWorkersEditing/UnlockWorkersEditingwere never actually called aroundbuildFilesitself.Fix
Aggregator.workersMu sync.Mutex.Set*Workers(Collate/Build, Merge, Compress, BuildAccessors) acquireworkersMubefore the flag-check + write. The combined check+write under the mutex closes the load-then-write race window.LockWorkersEditing/UnlockWorkersEditingacquireworkersMuaround the bool toggle — provides a happens-before edge so any priorSet*Workers' write is visible to thebuildFilesgoroutine that subsequently readsCompressorCfg, and any concurrentSet*Workersblocks briefly until the toggle is observable.Aggregator.buildFileswraps its body inLockWorkersEditing/defer UnlockWorkersEditingso chain-tip-drivenSet*Workerscalls are no-oped for the duration of collation.Test plan
TestHistoryVerification_SimpleBlocks -racepasses (was the original repro)go test -race -short ./db/state/...clean (includes the aggregator + state-domain test suites)make lintcleanLockWorkersEditing/UnlockWorkersEditingsemantically stronger — they now actually synchronise rather than being a flag with no fence