Skip to content

legacy: createUtxos calls SetMinedMulti with unbounded slice — stalls aerospike on fat blocks (regression from #854) #936

@oskarszoon

Description

@oskarszoon

Summary

legacy/netsync/handle_block.go:createUtxos calls SetMinedMulti with one unbounded slice containing every pre-existing tx in the block. On fat blocks (mainnet 755,880 = 2.87M txs, almost all pre-existing via propagation) this becomes a monolithic 2.87M-record aerospike BatchOperate that exhausts the client connection pool, hits MAX_RETRIES_EXCEEDED with NETWORK_ERROR / connection reset by peer, and stalls sync in an infinite retry loop.

Regression introduced by #854 (fix(legacy): merge blockID into pre-existing tx in createUtxos, merged 2026-05-13).

Reproduction

Mainnet sync past block 755,880 (2.3 GB, 2,867,288 txs) on a single-node aerospike .docker.m deployment.

Observed on bsva-ovh-teranode-eu-3 running v0.15.2-beta-1 (commit f44080f06, includes #929 arena fix).

Log signature, repeating every ~30s:

ERROR | netsync/handle_block.go:698 | legacy| [HandleBlockDirect][000000000000000002c365bec2f13cb3ba6334ebee5a0325201464e67b7fcecc 755880] 2867288 txs, peer ... DONE in 34.5s with error: PROCESSING (4): failed to merge blockID into 2867287 pre-existing txs -> STORAGE_ERROR (69): aerospike BatchOperate error -> UNKNOWN (0): ResultCode: MAX_RETRIES_EXCEEDED, Iteration: 5, InDoubt: true, Node: A1 172.18.0.5:3000: command execution timed out on client: Exceeded number of retries.
  ResultCode: NETWORK_ERROR, ...
  write tcp 172.18.0.10:53374->172.18.0.5:3000: write: connection reset by peer

Block download + decode succeeds (34s). The SetMinedMulti merge step is where it dies. After failure, sync resets to 755,879, downloads block again, repeats.

Root cause

services/legacy/netsync/handle_block.go:643-700 (post-#854):

var (
    existingTxsMu    sync.Mutex
    existingTxHashes []*chainhash.Hash
)

// create all the utxos first
for _, txHash := range txMap.Keys() {           // iterates every tx in the block
    g.Go(func() error {
        if _, err := sm.utxoStore.Create(...); err != nil {
            if errors.Is(err, errors.ErrTxExists) {
                existingTxsMu.Lock()
                existingTxHashes = append(existingTxHashes, &txHash)   // accumulates
                existingTxsMu.Unlock()
                return nil
            }
            ...
        }
        ...
    })
}
g.Wait()

if len(existingTxHashes) > 0 {
    if _, err = sm.utxoStore.SetMinedMulti(ctx, existingTxHashes, utxo.MinedBlockInfo{...}); err != nil {   // monolithic call
        return errors.NewProcessingError("failed to merge blockID into %d pre-existing txs", len(existingTxHashes), err)
    }
}

SetMinedMulti itself (stores/utxo/aerospike/set_mined.go:158) does not chunk — it submits len(hashes) records in a single executeBatchOperation. The aerospike client splits internally at its default BatchSize=5000, producing ~574 sub-requests for 2.87M entries. With ConnectionQueueSize=16 (current utxostore.docker.m URL setting) and LimitConnectionsToQueueSize=true, the connection pool saturates and sub-requests time out / reset → whole BatchOperate fails after 5 retries.

Why the #854 reference pattern works in its original site

#854 mirrored services/blockvalidation/quick_validate.go:1090-1160 createAndSpendUTXOsForBatch. That function is invoked per-batch (batch *SubtreeProcessingBatch), so existingTxHashes is naturally bounded by batch size — typically thousands at most, not millions. The legacy implementation re-uses the same SetMinedMulti call but lost the per-batch invocation boundary.

stores/utxo/aerospike/longest_chain.go:51-53 already demonstrates the chunked pattern for the closely related MarkTransactionsOnLongestChain flow:

batchSize := s.settings.UtxoStore.MaxMinedBatchSize             // 1024
numChunks := (len(txHashes) + batchSize - 1) / batchSize
numWorkers := min(s.settings.UtxoStore.MaxMinedRoutines, numChunks)   // 8 on docker.m

createUtxos should adopt this.

Pre-#854 behaviour for context

Pre-v0.15 the async setTxMinedStatusSetMinedMulti path handled this merge after block accept. PR #711 added a quickValidation fast path that skipped that step; PR #854 reinstated the merge but moved it into the synchronous critical-path createUtxos without chunking. So this hot path went from async-and-tolerated to synchronous-and-unbounded in one step.

Proposed fix

Caller-side chunk in createUtxos. Use existing settings (UtxoStore.MaxMinedBatchSize, UtxoStore.MaxMinedRoutines). Roughly:

if len(existingTxHashes) > 0 {
    batchSize := sm.settings.UtxoStore.MaxMinedBatchSize
    numWorkers := sm.settings.UtxoStore.MaxMinedRoutines

    g, gCtx := errgroup.WithContext(ctx)
    util.SafeSetLimit(g, numWorkers)

    for i := 0; i < len(existingTxHashes); i += batchSize {
        chunk := existingTxHashes[i:min(i+batchSize, len(existingTxHashes))]
        g.Go(func() error {
            _, err := sm.utxoStore.SetMinedMulti(gCtx, chunk, utxo.MinedBlockInfo{...})
            return err
        })
    }
    if err := g.Wait(); err != nil {
        return errors.NewProcessingError("failed to merge blockID into pre-existing txs", err)
    }
}

Alternative: push the chunking into SetMinedMulti itself, which fixes every caller (there are at least two: this one and the per-batch one in quick_validate.go). Trade-off — the per-batch caller doesn't need the chunking but wouldn't be harmed by it either.

Affected hosts

  • bsva-ovh-teranode-eu-3 (mainnet sync, currently stuck on 755,880)

Captured artifacts (local, available on request)

Workarounds while waiting for the fix

  • Increase ConnectionQueueSize in utxostore.docker.m URL from 16 to 64+
  • Add BatchSize=1024 and SocketTimeout=120s to aerospike_batchPolicy
  • Server-side: set proto-fd-max 30000 and explicit service-threads in config/aerospike.conf

None of these eliminate the underlying monolithic batch; they just buy headroom.

Related

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions