Skip to content

db/state: serialize Aggregator.Set*Workers to fix race with collate (blocks #21039 CI)#21062

Merged
yperbasis merged 1 commit into
mainfrom
fix/aggregator-workers-race
May 8, 2026
Merged

db/state: serialize Aggregator.Set*Workers to fix race with collate (blocks #21039 CI)#21062
yperbasis merged 1 commit into
mainfrom
fix/aggregator-workers-race

Conversation

@mh0lt

@mh0lt mh0lt commented May 8, 2026

Copy link
Copy Markdown
Contributor

🚨 Blocks #21039 CI — the data race this PR fixes is the only reason #21039's race-tests / execution-other job fails. #21039 cannot merge until its CI is green, and the only way that happens is for this fix to land first (or be folded in, which would muddy two unrelated PRs). Please prioritise reviewing/merging.

Summary

Fixes a long-standing data race that the Go race detector flags when a chain-tip PresetChainTipConcurrency call from ExecV3 overlaps with a background buildFiles collation. The race is between:

  • Write in Aggregator.SetCompressWorkers (aggregator.go:623) — ii.CompressorCfg.Workers = i
  • Read in InvertedIndex.collate (inverted_index.go:1003) — seg.NewCompressor(..., ii.CompressorCfg, ...) reads the cfg by value

Surfaced as a CI failure on PR #21039 (where the race-tests job actually ran):

race-tests / tests-linux (ubuntu-latest, execution-other) → TestHistoryVerification_SimpleBlocks

Root cause

Aggregator.lockWorkersEditing already existed as a soft-skip flag for Set*Workers, but:

  1. The bool read/write was unsynchronised — the if lockWorkersEditing check could observe a stale false while buildFiles had just toggled it true, leading to a concurrent CompressorCfg.Workers = i write.
  2. LockWorkersEditing / UnlockWorkersEditing were never actually called around buildFiles itself.

Fix

  • New Aggregator.workersMu sync.Mutex.
  • All four Set*Workers (Collate/Build, Merge, Compress, BuildAccessors) acquire workersMu before the flag-check + write. The combined check+write under the mutex closes the load-then-write race window.
  • LockWorkersEditing / UnlockWorkersEditing acquire workersMu around the bool toggle — provides a happens-before edge so any prior Set*Workers' write is visible to the buildFiles goroutine that subsequently reads CompressorCfg, and any concurrent Set*Workers blocks briefly until the toggle is observable.
  • Aggregator.buildFiles wraps its body in LockWorkersEditing / defer UnlockWorkersEditing so chain-tip-driven Set*Workers calls are no-oped for the duration of collation.

Test plan

  • TestHistoryVerification_SimpleBlocks -race passes (was the original repro)
  • go test -race -short ./db/state/... clean (includes the aggregator + state-domain test suites)
  • make lint clean
  • No public-API change beyond making LockWorkersEditing / UnlockWorkersEditing semantically stronger — they now actually synchronise rather than being a flag with no fence

The race detector flagged a concurrent read/write on
InvertedIndex.CompressorCfg.Workers when ExecV3's chain-tip-driven
PresetChainTipConcurrency() ran while a background buildFiles
goroutine was inside InvertedIndex.collate, which reads
ii.CompressorCfg by value to construct a seg.NewCompressor.

Stack from the failing race-tests run on PR #21039 CI:

  Read at 0x... by goroutine N:
    db/state.(*InvertedIndex).collate inverted_index.go:1003
    db/state.(*Aggregator).buildFiles aggregator.go:927

  Previous write at 0x... by goroutine M:
    db/state.(*Aggregator).SetCompressWorkers aggregator.go:623
    db/state.(*Aggregator).PresetChainTipConcurrency aggregator.go:651
    execution/stagedsync.ExecV3 exec3.go:131

The lockWorkersEditing flag already existed as a soft-skip for
Set*Workers calls, but its raw-bool read/write under no mutex meant
(a) the read of the flag could observe stale state and let the write
proceed concurrently with collate, and (b) the LockWorkersEditing /
UnlockWorkersEditing toggle wasn't being called around buildFiles.

Fix:

  - Add Aggregator.workersMu sync.Mutex.
  - All four Set*Workers (Collate/Build, Merge, Compress,
    BuildAccessors) acquire workersMu before checking
    lockWorkersEditing and writing. The combined check+write under
    the mutex closes the load-then-write race window.
  - LockWorkersEditing / UnlockWorkersEditing acquire workersMu
    around the bool toggle, providing happens-before so any prior
    Set*Workers' write is visible to the buildFiles goroutine
    that subsequently reads CompressorCfg, and any concurrent
    Set*Workers blocks until the toggle is observable.
  - Aggregator.buildFiles wraps its body in
    LockWorkersEditing / defer UnlockWorkersEditing so the
    chain-tip-driven Set*Workers calls are no-oped for the duration
    of collation, eliminating the race.

Verified locally: TestHistoryVerification_SimpleBlocks now passes
under -race; full ./db/state/... -race -short clean.
@mh0lt mh0lt changed the title db/state: serialize Aggregator.Set*Workers to fix race with collate db/state: serialize Aggregator.Set*Workers to fix race with collate (blocks #21039 CI) May 8, 2026
@mh0lt

mh0lt commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

@yperbasis @erigontech/maintainers — prioritising this for review: this small, self-contained race fix in db/state/aggregator.go is the only thing blocking #21039's CI from going green (race surfaced on the matrix run). Standalone change, 1 file, ~46 lines added, no public API change.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Go data race in db/state where ExecV3’s chain-tip-driven worker reconfiguration (PresetChainTipConcurrencySet*Workers) can overlap with snapshot collation/build work (buildFiles) and concurrently mutate/read compressor configuration.

Changes:

  • Added Aggregator.workersMu and made lockWorkersEditing reads/writes occur under that mutex to establish a clear happens-before relationship.
  • Updated all Set*Workers methods to serialize the “check lock + write” sequence under workersMu, avoiding unsynchronized flag checks and concurrent config writes.
  • Wrapped Aggregator.buildFiles with LockWorkersEditing / UnlockWorkersEditing to suppress worker-config writes during collation/build reads.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/state/aggregator.go
Comment on lines +885 to +886
a.LockWorkersEditing()
defer a.UnlockWorkersEditing()
Comment thread db/state/aggregator.go
Comment on lines +720 to +724
func (a *Aggregator) LockWorkersEditing() {
a.workersMu.Lock()
defer a.workersMu.Unlock()
a.lockWorkersEditing = true
}
@yperbasis yperbasis added this pull request to the merge queue May 8, 2026
Merged via the queue into main with commit 2d4e53d May 8, 2026
41 of 42 checks passed
@yperbasis yperbasis deleted the fix/aggregator-workers-race branch May 8, 2026 11:01
@lystopad

Copy link
Copy Markdown
Member

Do you have plans to cherry-pick it to r3.4 ?

pull Bot pushed a commit to Dustin4444/erigon that referenced this pull request May 13, 2026
## Summary

- Extract `merge`, `collateAndBuild`, `allowEditing`, and the mutex from
`Aggregator` into a new `aggWorkers` value type
- Public `Set*Workers` methods become unexported helpers
(`setCompressWorkers`, `setBuildAccessorsWorkers`); the merge/collate
counts are accessed via `workers.getMerge()` / `workers.setMerge()` etc.
- Remove the now-unused `SetCollateAndBuildWorkers` / `SetMergeWorkers`
/ `SetCompressWorkers` stubs from `ForkableAgg`

No behaviour change — this is a structural cleanup follow-up to erigontech#21062.

---------

Co-authored-by: lupin012 <58134934+lupin012@users.noreply.github.com>
Co-authored-by: info@weblogix.biz <admin@10gbps.weblogix.it>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants