fix(blockchain/sql): snapshot-isolate on_main_chain rebuild to prevent startup race#1044
Conversation
…t startup race rebuildOnMainChainFlag ran under READ COMMITTED, so its bestQ SELECT and q1/q2 UPDATEs used independent snapshots. The startup goroutine calls it without slowPathMu; a block committed by a concurrent StoreBlock between bestQ and q1 is visible to q1 but absent from the main_chain CTE (rooted at the stale pre-block bestBlockID), so q1's NOT-IN clause clears that block's on_main_chain flag — leaving an on-chain block with an off-chain parent and wedging GetLatestBlockHeaderFromBlockLocator (BLOCK_NOT_FOUND). Run the rebuild at REPEATABLE READ so all three statements share one snapshot; retry up to 3x on 40001/40P01 (the startup goroutine holds no lock and can conflict with a slow-path StoreBlock). Return the raw driver error on those codes so the retry wrapper can classify it. SQLite is unaffected (single pass); the single-statement reconcileOnMainChain reorg path is unchanged.
|
🤖 Claude Code Review Status: Complete Current Review: Key Observations:
History:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-08 13:06 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve — the fix is correct, minimal, and well-documented.
Verified the load-bearing claims independently:
- Root cause is accurate: under READ COMMITTED the recursive CTE roots at the stale
bestBlockID = genesiswhile q1'sWHERE on_main_chain = truestatement-snapshot sees the freshly committed block1, so it's wrongly cleared. REPEATABLE READ makes the post-snapshot block fully invisible to the UPDATEs, andStoreBlockalready owns its own flag — consistent either way. - SQLite no-op confirmed against the pinned modernc/sqlite v1.47.0:
newTx(tx.go) ignoresopts.Isolationentirely, soLevelRepeatableReadnever errors there and the retry loop is a single pass. usql.PgErrSerializationFail/PgErrDeadlockDetected(40001/40P01) match exactly what a REPEATABLE READ write-write conflict raises; dual pgx + lib/pq handling mirrors the existing pattern in util/usql/retry.go.- Returning the raw typed driver error on the retry codes is the right call —
errors.NewStorageErrorflattens the cause and would defeaterrors.As(*pgconn.PgError). The deferred rollback still fires via the named return. - Both other callers (Invalidate/Revalidate) hold slowPathMu and only log the result, so the unwrapped error on exhaustion is benign.
One request before merge: please add at least a unit test for isSerializationRetry. It's pure and trivially testable (the existing util/usql/retry_test.go cases are a ready template) — feed it &pq.Error{Code:"40001"}, &pgconn.PgError{Code:"40P01"}, sql.ErrNoRows, and a plain error. This satisfies the repo's test-first guardrail and locks in the classification logic that the retry path depends on.
Minor / non-blocking:
- The retry loop re-runs immediately with no backoff, whereas usql.retryOperation uses a 100ms base delay — a small sleep/jitter would match house style.
isSerializationRetrypartially duplicates the unexportedusql.isRetriable; exporting a narrowusql.IsSerializationConflict(err)later would give both sites one source of truth.
…y logging Address review on bsv-blockchain#1044: - Add TestIsSerializationRetry (Simon's pre-merge request): table-driven coverage of the 40001/40P01 classification across both pgx (*pgconn.PgError) and lib/pq (*pq.Error) drivers, plus wrapped errors, non-retryable SQLSTATEs, sql.ErrNoRows, a plain error, and nil. Pure, no DB. - Fix the retry loop's misleading log (Copilot): "retrying" no longer fires on the final attempt where no retry follows; a distinct one-shot log records exhaustion instead. The exhausted error is still returned raw (unwrapped) so callers can classify it via errors.As — wrapping would flatten the typed *pgconn.PgError. - Add context-aware exponential backoff (100ms, 200ms) between attempts to match util/usql house style, aborting early on ctx cancellation.
…forbidigo) golangci forbidigo forbids fmt.Errorf repo-wide. The wrapped-error case needs a chain errors.As can traverse (the teranode errors package flattens the typed *pgconn.PgError cause), so use a local Unwrap helper; the plain-error case uses the sql.ErrConnDone stdlib sentinel. No behaviour change.
|


Summary
rebuildOnMainChainFlagnow runs its rebuild inside a REPEATABLE READ transaction (with a bounded retry on serialization conflicts) instead of READ COMMITTED. This closes a startup race that could leave a main-chain block flaggedon_main_chain = false— corrupting the canonical chain view.The bug
New()launches a startup goroutine that callsrebuildOnMainChainFlag(ctx, false)without holdingslowPathMu. The function opens a READ COMMITTED transaction, so each statement gets its own snapshot:bestQreadsbestBlockID = genesis(captured before any blocks are stored).StoreBlock(block1)commits block1 withon_main_chain = true.q1then runs. Under READ COMMITTED its statement snapshot now sees the committed block1, but themain_chainCTE is still rooted at the stalebestBlockID = genesis, which walks to{genesis}only. block1 (id != genesis) matchesq1's... AND id NOT IN (SELECT id FROM main_chain)clause and is cleared toon_main_chain = false.q2cannot repair it (block1 is not inmain_chain = {genesis}).Final state:
genesis = true, block1 = false, block2 = true— an impossible chain (an on-chain block whose parent is off-chain).GetLatestBlockHeaderFromBlockLocatorthen can't find the common ancestor and returnsBLOCK_NOT_FOUND, wedging header sync. It reproduces ~60% of the time under-raceon a node that starts syncing immediately after boot, and is the dominant source ofstores/blockchain/sqlunit-test flakiness.The fix
rebuildOnMainChainFlag's transaction now usessql.LevelRepeatableRead, sobestQ,q1andq2all share one snapshot. Blocks committed afterbestQare invisible to the UPDATEs and can no longer be wrongly cleared.40001(serialization_failure) /40P01(deadlock_detected) on a write-write conflict. The one caller withoutslowPathMu(the startup goroutine) can hit this against a slow-pathStoreBlock, so the whole transaction is retried (up to 3×).*sql.Txbypasses usql's per-statement retry, hence the retry lives here. The raw driver error is returned on those codes so the wrapper can classify it (errors.NewStorageErrorwould otherwise flatten the typed*pgconn.PgError).reconcileOnMainChain, a single-statement UPDATE that already shares one snapshot) is unchanged.InvalidateBlock/RevalidateBlockalready holdslowPathMuand are unaffected.Verification
-racerepro (block2's plain extend flipping parent block1 off-main).-race: the targeted repro ×60 → 0 failures (was ~60%);GetLatestBlockHeaderFromBlockLocatorfork tests (postgres + sqlite) ×20 → 0 failures; fullstores/blockchain/sqlpackage → green.Note
The retry is load-bearing:
backgroundRefreshLoopdoes not re-runrebuildOnMainChainFlag, so a rebuild that exhausted its retries would self-heal only on the next restart. Startup forks are rare, but worth a metric/alert if this retry is ever observed firing in production.Test plan
go test -race -tags testtxmetacache ./stores/blockchain/sql/...BLOCK_NOT_FOUNDfromGetLatestBlockHeaderFromBlockLocator)