fix(validator): synchronise mtpStore with sync.RWMutex#838
Conversation
|
🤖 Claude Code Review Status: Complete SummaryThis PR correctly addresses a critical race condition in the validator's MTP cache. The implementation is sound and all previously raised review feedback has been addressed. Code Quality: The synchronization is correctly implemented with Test Coverage: Comprehensive race tests added covering both failure modes. Mock expectations properly enforce single-fetch behavior. All Copilot feedback addressed in commit d334b5f. Documentation: Inline comments accurately describe the locking strategy and trade-offs. The PR description provides excellent context on root cause and timing. No issues found. The fix is minimal, surgical, and correctly synchronized. History:
|
Two distinct races on Validator.mtpStore:
1. Same-block writer-writer. Two concurrent CheckSubtreeFromBlock RPCs
(legacy retry while the first call is still in-flight) both reached
EnsureMTPLoaded with an empty mtpStore. Both fetched ~1.45 M MTP
values and raced on the append, corrupting the store. The corruption
triggered the guard in validateTransaction, which returned an error
and fed the outer retry loop in ValidateSubtreeInternal: each retry
re-fetched 1.45 M entries, producing the multi-hour CPU peg and sync
stall observed on testnet at block 1,451,504.
2. Cross-block reader-writer. While block N's per-transaction goroutines
read mtpStore inside validateTransaction, block N+1's EnsureMTPLoaded
can run concurrently. Its append re-allocates the backing array and
its overlap patch mutates entries in place, so unsynchronised readers
observe torn slice headers / cell values.
Fix: introduce mtpMu sync.RWMutex on Validator.
- EnsureMTPLoaded acquires the write lock for the full body. The second
same-block caller fast-paths out after acquiring the lock when the
range it needs is already populated.
- validateTransaction acquires the read lock around its MTP lookups
(length guard + utxo / block MTP fetches), so cross-block writers
serialise behind in-flight readers.
Per-block contention is negligible: EnsureMTPLoaded runs once per block
before per-tx goroutines start, and per-tx readers only contend with each
other on the read lock.
Tests:
- TestEnsureMTPLoaded_ConcurrentCallsNeitherHangsNorRaces fires two
same-block EnsureMTPLoaded calls against an empty store; fails under
-race without the write lock, passes with it.
- TestEnsureMTPLoaded_CrossBlockReadersAndWritersDoNotRace runs eight
reader goroutines indexing mtpStore via the validateTransaction
pattern while a writer extends the store to a later block height;
fails under -race without the read lock, passes with it.
9ac5a94 to
cdec5b3
Compare
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-09 20:05 UTC |
- validateTransaction releases the read lock immediately after the mtpStore lookups (extracted to readMTPsLocked) instead of holding it through ValidateBIP68. Reduces contention with cross-block writers. - Same-block race test now asserts exactly-one fetch via Once() + AssertExpectations, so a regression that re-fetches without racing (the production failure mode) is caught. - Cross-block race test asserts EnsureMTPLoaded errors from the main goroutine instead of via require.* in a child goroutine (require uses Goexit which only stops the goroutine it runs in), and asserts mock expectations to defend against accidental over-fetch.
Three unit tests for the readMTPsLocked helper extracted in d334b5f: - TestReadMTPsLocked_HappyPath: returns the expected per-input MTPs and block MTP for a fully populated store. - TestReadMTPsLocked_ClampsOutOfRangeUTXOs: utxoHeights at or above the store length fall back to the blockMTP value (matches production behaviour for unconfirmed parents whose effective height exceeds blockMTPHeight). - TestReadMTPsLocked_GuardFiresOnUnpopulatedStore: the missing-load guard returns a processing error when mtpStore is too short for the requested blockMTPHeight. Lifts coverage on the new code added by the mutex / readMTPsLocked refactor.
Two unit tests driving validateTransaction through the BIP68 branch: - TestValidateTransaction_BIP68PathReadsMTPStore: exercises the SkipPolicyChecks=true path with a populated mtpStore so the readMTPsLocked call site, error wiring, and ValidateBIP68 invocation are all covered. - TestValidateTransaction_BIP68GuardFiresOnUnpopulatedStore: drives validateTransaction with an empty mtpStore so the missing-load guard surfaces a processing error to the caller. Together with the readMTPsLocked unit tests, lifts coverage on the new code introduced by the mutex / readMTPsLocked refactor above the Sonar quality-gate threshold.
|



TL;DR
Validator's in-memory Median Time Past cache (
mtpStore) had no synchronisation. Two concurrent populate calls — followed by a third reader — could each see an empty cache, both fetch ~1.45 M entries, race on the append, and corrupt the slice. Once corrupted, the BIP68 guard invalidateTransactionrejected every transaction, andValidateSubtreeInternal's outer retry loop re-issued the 1.45 M-entry fetch on every retry — pegging CPU and stalling sync indefinitely.This PR adds a
sync.RWMutexaroundmtpStore, so populates serialise (the second caller fast-paths out instead of re-fetching) and readers don't observe torn state during a populate.Why this only surfaced at one block on testnet
The bug is not block-specific — there is nothing about block 1,451,504 that triggers it. The trigger is a timing window during the first
EnsureMTPLoadedcall after validator startup, which exists at every block ≥CSVHeight(testnetCSVHeight= 770,112).What aligned on the testnet stall:
mtpStore) and was being driven by a legacy sync.CheckSubtreeFromBlockRPC for a post-CSV block kicked offEnsureMTPLoaded, which had to fetch ~1.45 M MTP entries from the blockchain store. That fetch is the only slow operation in the call.CheckSubtreeFromBlock. The retry enteredEnsureMTPLoadedwithlen(mtpStore)still 0, started its own 1.45 M-entry fetch, and raced the original on the slice append.validateTransaction.ValidateSubtreeInternal's outer retry loop re-issued the fetch on every retry — minutes per attempt — producing the multi-hour CPU peg observed on the box.Block 1,451,504 is simply where the validator happened to be when these timings aligned. Mainnet has not surfaced it because mainnet validators tend to be long-lived (the cache populates once and the race window closes), and mainnet legacy peers are less prone to the retry-on-timeout pattern testnet sees.
Detail: the two races
CheckSubtreeFromBlockRPCs (legacy retries a call while the first is still in flight) both observed an emptymtpStore, both fetched ~1.45 M MTP entries, and both raced on the append — corrupting the store. The corruption tripped the guard invalidateTransaction, which surfaced an error to the outer retry loop inValidateSubtreeInternal. Each retry repeated the 1.45 M-entry fetch, producing the multi-hour CPU peg and sync stall observed on testnet around block 1,451,504.mtpStoreinsidevalidateTransaction, block N+1'sEnsureMTPLoadedcan run concurrently. The append re-allocates the backing array and the overlap patch mutates entries in place, so unsynchronised readers observe torn slice headers / cell values.Fix
services/validator/Validator.go:mtpMu sync.RWMutexonValidator.EnsureMTPLoadedacquires the write lock for the full body. The second same-block caller fast-paths out after acquiring the lock when the range it needs is already populated (no second gRPC fetch).validateTransactionreads the MTP values via a smallreadMTPsLockedhelper that takes the read lock only for the lookups themselves and releases it beforeValidateBIP68runs. Cross-block writers serialise behind any in-flight readers; the per-tx ECDSA / sequence-lock work is unconstrained by the lock.services/validator/Validator_test.go:TestEnsureMTPLoaded_ConcurrentCallsNeitherHangsNorRaces: two same-blockEnsureMTPLoadedcalls against an empty store. Mock is.Once()and the test assertsmockClient.AssertExpectations(t), so a regression that re-fetches without racing (the production failure mode) fails the assertion. Fails under-racewithout the write lock; passes with it.TestEnsureMTPLoaded_CrossBlockReadersAndWritersDoNotRace: eight readers indexingmtpStorevia thevalidateTransactionpattern while a writer extends the store to a later block height. Fails under-racewithout the read lock; passes with it. Asserts mock expectations to defend against accidental over-fetch.Test plan
go vet ./services/validator/...— cleango test -race ./services/validator/— 277 passed, 0 failedgo test -race -run TestEnsureMTPLoaded ./services/validator/— both targeted race tests passgolangci-lint run ./services/validator/...— cleanstaticcheck ./services/validator/...— cleanmtpStoreoutsidevalidateTransaction(already grepped — onlyEnsureMTPLoadeditself touches it elsewhere, under the write lock)Commits
cdec5b3d3— original fix:sync.RWMutex+ reader / writer locking + both race testsd334b5f77— review feedback: extractreadMTPsLockedto drop the read lock beforeValidateBIP68; tighten test mock expectations (Once()+AssertExpectations); replacerequire.*in goroutines with main-goroutine assertionNotes for reviewer
The
mtpStorefield andEnsureMTPLoadedwere introduced in #547 ("Re-implement BIP68"); the race has been latent since that PR.EnsureMTPLoadedintentionally holds the write lock throughGetMedianTimePastRange— releasing for the fetch reintroduces the same-block redundant-fetch race. The cold-start fetch (~1.45 M entries) runs once at startup before any per-tx readers; per-block extension fetches are tiny (12-overlap + new heights) and fast. Agolang.org/x/sync/singleflightwrapper would be a cleaner alternative if read-blocking ever shows up in profiling. (Open thread: r3213721754.)