Summary
legacy/netsync/handle_block.go:createUtxos calls SetMinedMulti with one unbounded slice containing every pre-existing tx in the block. On fat blocks (mainnet 755,880 = 2.87M txs, almost all pre-existing via propagation) this becomes a monolithic 2.87M-record aerospike BatchOperate that exhausts the client connection pool, hits MAX_RETRIES_EXCEEDED with NETWORK_ERROR / connection reset by peer, and stalls sync in an infinite retry loop.
Regression introduced by #854 (fix(legacy): merge blockID into pre-existing tx in createUtxos, merged 2026-05-13).
Reproduction
Mainnet sync past block 755,880 (2.3 GB, 2,867,288 txs) on a single-node aerospike .docker.m deployment.
Observed on bsva-ovh-teranode-eu-3 running v0.15.2-beta-1 (commit f44080f06, includes #929 arena fix).
Log signature, repeating every ~30s:
ERROR | netsync/handle_block.go:698 | legacy| [HandleBlockDirect][000000000000000002c365bec2f13cb3ba6334ebee5a0325201464e67b7fcecc 755880] 2867288 txs, peer ... DONE in 34.5s with error: PROCESSING (4): failed to merge blockID into 2867287 pre-existing txs -> STORAGE_ERROR (69): aerospike BatchOperate error -> UNKNOWN (0): ResultCode: MAX_RETRIES_EXCEEDED, Iteration: 5, InDoubt: true, Node: A1 172.18.0.5:3000: command execution timed out on client: Exceeded number of retries.
ResultCode: NETWORK_ERROR, ...
write tcp 172.18.0.10:53374->172.18.0.5:3000: write: connection reset by peer
Block download + decode succeeds (34s). The SetMinedMulti merge step is where it dies. After failure, sync resets to 755,879, downloads block again, repeats.
Root cause
services/legacy/netsync/handle_block.go:643-700 (post-#854):
var (
existingTxsMu sync.Mutex
existingTxHashes []*chainhash.Hash
)
// create all the utxos first
for _, txHash := range txMap.Keys() { // iterates every tx in the block
g.Go(func() error {
if _, err := sm.utxoStore.Create(...); err != nil {
if errors.Is(err, errors.ErrTxExists) {
existingTxsMu.Lock()
existingTxHashes = append(existingTxHashes, &txHash) // accumulates
existingTxsMu.Unlock()
return nil
}
...
}
...
})
}
g.Wait()
if len(existingTxHashes) > 0 {
if _, err = sm.utxoStore.SetMinedMulti(ctx, existingTxHashes, utxo.MinedBlockInfo{...}); err != nil { // monolithic call
return errors.NewProcessingError("failed to merge blockID into %d pre-existing txs", len(existingTxHashes), err)
}
}
SetMinedMulti itself (stores/utxo/aerospike/set_mined.go:158) does not chunk — it submits len(hashes) records in a single executeBatchOperation. The aerospike client splits internally at its default BatchSize=5000, producing ~574 sub-requests for 2.87M entries. With ConnectionQueueSize=16 (current utxostore.docker.m URL setting) and LimitConnectionsToQueueSize=true, the connection pool saturates and sub-requests time out / reset → whole BatchOperate fails after 5 retries.
Why the #854 reference pattern works in its original site
#854 mirrored services/blockvalidation/quick_validate.go:1090-1160 createAndSpendUTXOsForBatch. That function is invoked per-batch (batch *SubtreeProcessingBatch), so existingTxHashes is naturally bounded by batch size — typically thousands at most, not millions. The legacy implementation re-uses the same SetMinedMulti call but lost the per-batch invocation boundary.
stores/utxo/aerospike/longest_chain.go:51-53 already demonstrates the chunked pattern for the closely related MarkTransactionsOnLongestChain flow:
batchSize := s.settings.UtxoStore.MaxMinedBatchSize // 1024
numChunks := (len(txHashes) + batchSize - 1) / batchSize
numWorkers := min(s.settings.UtxoStore.MaxMinedRoutines, numChunks) // 8 on docker.m
createUtxos should adopt this.
Pre-#854 behaviour for context
Pre-v0.15 the async setTxMinedStatus → SetMinedMulti path handled this merge after block accept. PR #711 added a quickValidation fast path that skipped that step; PR #854 reinstated the merge but moved it into the synchronous critical-path createUtxos without chunking. So this hot path went from async-and-tolerated to synchronous-and-unbounded in one step.
Proposed fix
Caller-side chunk in createUtxos. Use existing settings (UtxoStore.MaxMinedBatchSize, UtxoStore.MaxMinedRoutines). Roughly:
if len(existingTxHashes) > 0 {
batchSize := sm.settings.UtxoStore.MaxMinedBatchSize
numWorkers := sm.settings.UtxoStore.MaxMinedRoutines
g, gCtx := errgroup.WithContext(ctx)
util.SafeSetLimit(g, numWorkers)
for i := 0; i < len(existingTxHashes); i += batchSize {
chunk := existingTxHashes[i:min(i+batchSize, len(existingTxHashes))]
g.Go(func() error {
_, err := sm.utxoStore.SetMinedMulti(gCtx, chunk, utxo.MinedBlockInfo{...})
return err
})
}
if err := g.Wait(); err != nil {
return errors.NewProcessingError("failed to merge blockID into pre-existing txs", err)
}
}
Alternative: push the chunking into SetMinedMulti itself, which fixes every caller (there are at least two: this one and the per-batch one in quick_validate.go). Trade-off — the per-batch caller doesn't need the chunking but wouldn't be harmed by it either.
Affected hosts
bsva-ovh-teranode-eu-3 (mainnet sync, currently stuck on 755,880)
Captured artifacts (local, available on request)
- Heap-raw + goroutines + allocs profile during the stall:
Workarounds while waiting for the fix
- Increase
ConnectionQueueSize in utxostore.docker.m URL from 16 to 64+
- Add
BatchSize=1024 and SocketTimeout=120s to aerospike_batchPolicy
- Server-side: set
proto-fd-max 30000 and explicit service-threads in config/aerospike.conf
None of these eliminate the underlying monolithic batch; they just buy headroom.
Related
Summary
legacy/netsync/handle_block.go:createUtxoscallsSetMinedMultiwith one unbounded slice containing every pre-existing tx in the block. On fat blocks (mainnet 755,880 = 2.87M txs, almost all pre-existing via propagation) this becomes a monolithic 2.87M-record aerospikeBatchOperatethat exhausts the client connection pool, hitsMAX_RETRIES_EXCEEDEDwithNETWORK_ERROR/connection reset by peer, and stalls sync in an infinite retry loop.Regression introduced by #854 (
fix(legacy): merge blockID into pre-existing tx in createUtxos, merged 2026-05-13).Reproduction
Mainnet sync past block 755,880 (2.3 GB, 2,867,288 txs) on a single-node aerospike
.docker.mdeployment.Observed on
bsva-ovh-teranode-eu-3runningv0.15.2-beta-1(commitf44080f06, includes #929 arena fix).Log signature, repeating every ~30s:
Block download + decode succeeds (34s). The
SetMinedMultimerge step is where it dies. After failure, sync resets to 755,879, downloads block again, repeats.Root cause
services/legacy/netsync/handle_block.go:643-700(post-#854):SetMinedMultiitself (stores/utxo/aerospike/set_mined.go:158) does not chunk — it submitslen(hashes)records in a singleexecuteBatchOperation. The aerospike client splits internally at its defaultBatchSize=5000, producing ~574 sub-requests for 2.87M entries. WithConnectionQueueSize=16(currentutxostore.docker.mURL setting) andLimitConnectionsToQueueSize=true, the connection pool saturates and sub-requests time out / reset → whole BatchOperate fails after 5 retries.Why the #854 reference pattern works in its original site
#854 mirrored
services/blockvalidation/quick_validate.go:1090-1160createAndSpendUTXOsForBatch. That function is invoked per-batch (batch *SubtreeProcessingBatch), soexistingTxHashesis naturally bounded by batch size — typically thousands at most, not millions. The legacy implementation re-uses the sameSetMinedMulticall but lost the per-batch invocation boundary.stores/utxo/aerospike/longest_chain.go:51-53already demonstrates the chunked pattern for the closely relatedMarkTransactionsOnLongestChainflow:createUtxosshould adopt this.Pre-#854 behaviour for context
Pre-v0.15 the async
setTxMinedStatus→SetMinedMultipath handled this merge after block accept. PR #711 added aquickValidationfast path that skipped that step; PR #854 reinstated the merge but moved it into the synchronous critical-pathcreateUtxoswithout chunking. So this hot path went from async-and-tolerated to synchronous-and-unbounded in one step.Proposed fix
Caller-side chunk in
createUtxos. Use existing settings (UtxoStore.MaxMinedBatchSize,UtxoStore.MaxMinedRoutines). Roughly:Alternative: push the chunking into
SetMinedMultiitself, which fixes every caller (there are at least two: this one and the per-batch one inquick_validate.go). Trade-off — the per-batch caller doesn't need the chunking but wouldn't be harmed by it either.Affected hosts
bsva-ovh-teranode-eu-3(mainnet sync, currently stuck on 755,880)Captured artifacts (local, available on request)
go-bt.Output.appendTo+Tx.toBytesHelper(mitigated by fix(blockvalidation): arena-backed tx decode to eliminate catch-up OOM #929 in v0.15.2)Workarounds while waiting for the fix
ConnectionQueueSizeinutxostore.docker.mURL from 16 to 64+BatchSize=1024andSocketTimeout=120stoaerospike_batchPolicyproto-fd-max 30000and explicitservice-threadsinconfig/aerospike.confNone of these eliminate the underlying monolithic batch; they just buy headroom.
Related