fix(legacy/netsync): chunk SetMinedMulti in createUtxos (#936)#937
Conversation
…in#936) createUtxos was calling SetMinedMulti with one unbounded slice containing every pre-existing tx in the block. On fat mainnet blocks (e.g. 755880 = 2.87M txs) this monolithic batch exhausts the aerospike client connection pool and stalls sync in a MAX_RETRIES_EXCEEDED retry loop. Regression from bsv-blockchain#854. Split the merge across a worker pool bounded by UtxoStore.MaxMinedRoutines, each worker iterating its range in MaxMinedBatchSize chunks. Mirrors the proven MarkTransactionsOnLongestChain pattern. Fail-fast errgroup semantics match the existing createUtxos Create() loop: first chunk error cancels siblings via gCtx and propagates a wrapped ProcessingError. Fixes bsv-blockchain#936
- Restore tx count in error wrap for operator forensics - Fix incorrect chunk-count comment (4 chunks not 3 with 2 workers) - Drop redundant inline comments per repo conventions
ChunkErrorPropagates with 20 txs / batchSize=4 / 2 workers yields 6 chunks (2 workers × ceil(10/4) chunks each), not 5. The previous comment carried the pre-worker-pool math.
|
🤖 Claude Code Review Status: Complete No issues found. This PR successfully addresses the connection pool exhaustion issue (#936) by chunking SetMinedMulti calls in createUtxos. Strengths:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-25 13:31 UTC |
…v-blockchain#937) - TestSyncManager_createUtxos_ChunkFailureCancelsSiblings: orchestrates a scenario that observably fails when the mergeCtx.Err() short-circuit is removed. A naive "fail every call after N" cannot distinguish a working check from a stripped impl because each worker dies on its first error anyway. Instead we block the surviving worker inside SetMinedMulti until the failing call has triggered errgroup cancellation, then assert the surviving worker's second iteration is suppressed by the loop-top check. Verified by mutation: removing the check produces failureCount == 2 (assertion at handle_block_test.go:1418 fails). - Rename ChunkErrorPropagates -> ChunkErrorReturnsWrappedProcessingError to accurately describe what it tests (drops the misleading short-circuit assertion now owned by the new test). - Add boundary tests: ExactBatchSize (n == batchSize) and OneOverBatchSize (n == batchSize + 1). - Add clamp tests: BatchSizeZeroClamped and RoutinesZeroClamped, exercising the defensive clamps that were previously untested. - Extract newChunkingTestSetup, recordChunksOnMock, and assertUnionCovers helpers to DRY repeated boilerplate across the chunking tests. Follow-up on review feedback from PR bsv-blockchain#937.
ordishs
left a comment
There was a problem hiding this comment.
LGTM — the chunking pattern correctly mirrors stores/utxo/aerospike/longest_chain.go:MarkTransactionsOnLongestChain, the defensive clamps are a small improvement over the reference, and the loop-variable capture / non-overlapping slice access are race-free. A few notes before / after merge:
TestSyncManager_createUtxos_ChunkFailureCancelsSiblings is potentially flaky
The test is the only one that exercises the if mergeCtx.Err() != nil { return mergeCtx.Err() } short-circuit. With totalTxs=16, batchSize=4, routines=2, each worker performs exactly 2 iterations, and the .Once() chain orders exp1 → exp2 → exp3 in mock-match order, not worker order. Two scheduler interleavings are legal:
- Path A (test intent). Worker A iter 1 → exp1; Worker B iter 1 → exp2 (blocks); Worker A iter 2 → exp3 (fails, cancels); Worker B unblocks; Worker B iter 2 short-circuits. Removing the short-circuit fires exp4 →
failureCount == 2→ test fails. ✅ - Path B (also legal). Worker A iter 1 → exp1; Worker A iter 2 → exp2 (blocks); Worker B iter 1 → exp3 (fails, cancels); Worker A iter 2 unblocks → returns nil; A done (no iter 3), B done. Neither worker has another iteration left to short-circuit. Removing the short-circuit makes no observable difference; the test passes regardless.
Under Path B the test green-lights a stripped implementation. Suggest either bumping totalTxs so both workers have ≥ 2 post-trigger iterations, or constraining the call order with an explicit handoff (e.g. atomic counter pinning the first call to a specific worker via the captured chunk identity).
Manual production validation still unchecked
The last item in the test plan — resume mainnet sync on bsva-ovh-teranode-eu-3 at block 755880 — is the only check that actually proves the connection-pool exhaustion is resolved. With default settings (MaxMinedRoutines=128, ConnectionQueueSize=16 per node per the PR description) the worker count still exceeds the per-node connection pool — the fix relies on Aerospike sharding spreading the writes across nodes. If shard skew is high on 755880, the same symptom could recur. Worth gating merge on completing that run.
Minor — discarded return map
SetMinedMulti returns map[chainhash.Hash][]uint32 of prior BlockIDs. Both pre- and post-PR code discards it, so no regression — but worth a sentence in the comment block on why the legacy netsync path is fine without prior-block-ID tracking, or a TODO if it should aggregate across chunks later.
Minor — settings doc drift
settings/utxostore_settings.go:49 still describes MaxMinedRoutines as "Block with 10,000 transactions split across 128 parallel workers (approximately 78 transactions per worker)." That math reflects the SafeSetLimit topology in model/update-tx-mined.go, not the worker-range topology now used here and in longest_chain.go. Two patterns coexist with one shared setting key — fine, but a follow-up to align the docs would help operators tune it.
Nit — symmetry with longest_chain.go
The batchSize < 1 and numWorkers < 1 clamps added here would also protect longest_chain.go:51-53 from misconfiguration. Optional follow-up.
…tant Existing test orchestrated worker identity via channels but worker A vs worker B at runtime isn't predictable — existingTxHashes is appended by parallel Create() goroutines in scheduler-derived order. Under one legal interleaving (Path B from PR bsv-blockchain#937 review), removing the short-circuit check did not change the assertion outcome. Replace with a scheduler-agnostic design: 32 txs / batchSize=4 / routines=2 gives 4 chunks per worker. After the trigger fires on the 3rd call, the surviving worker has multiple remaining iterations across every interleaving. With the check intact, the surviving worker bails at the top of its loop on every remaining iteration → postTriggerCount ≤ 1 (in-flight). Without the check, the surviving worker continues issuing SetMinedMulti → postTriggerCount ≥ 2 (observed 2-4 across interleavings). Verified by 20× mutation runs (all fail under mutation; observed counts: 14×4, 4×3, 2×2) and 20× normal runs (all pass).
|



Summary
SetMinedMulticall at the tail ofcreateUtxoswith a caller-side worker pool: divideexistingTxHashesintoUtxoStore.MaxMinedRoutinesranges, each worker iterates its range inUtxoStore.MaxMinedBatchSizechunks. Pattern mirrorsstores/utxo/aerospike/longest_chain.go:MarkTransactionsOnLongestChain.errgroup.WithContextsemantics — matches the existingCreate()loop already increateUtxos. First chunk failure cancels siblings viagCtx; wrappedProcessingErrorpropagated up.fix(legacy): merge blockID into pre-existing tx in createUtxos) which moved the per-block merge into the synchronous critical path without chunking. On block 755880 (2.87M txs, almost all pre-existing via propagation) the monolithic 2.87M-recordBatchOperateexhausts the aerospike client connection pool (ConnectionQueueSize=16on.docker.m), hitsMAX_RETRIES_EXCEEDED/NETWORK_ERROR/connection reset by peer, and stalls mainnet sync in an infinite retry loop.Fixes #936.
Test plan
go test -race ./services/legacy/netsync/...passes (88/88).go vet ./services/legacy/netsync/...clean.golangci-lint run ./services/legacy/netsync/...clean.TestSyncManager_createUtxos_ChunksExistingTxsasserts call count, per-chunk size bound, and union coverage withMaxMinedBatchSize=4/MaxMinedRoutines=2/ 10 txs.TestSyncManager_createUtxos_ChunkErrorPropagatesasserts wrappedProcessingErrorand sibling short-circuit on first chunk failure.TestSyncManager_createUtxos_MergesBlockIDsForExistingTxsleft untouched and still green (1-tx case flows through new path as 1 worker × 1 chunk).bsva-ovh-teranode-eu-3at block 755880; expect[createUtxos] merging blockID … into 2867287 pre-existing tx(s)followed by completion within the existingHandleBlockDirectdeadline (post-merge field validation).