Skip to content

chore(settings): raise aerospike client pool to 128 for docker.m#941

Merged
oskarszoon merged 4 commits into
bsv-blockchain:mainfrom
oskarszoon:chore/aerospike-pool-128-docker-m
May 26, 2026
Merged

chore(settings): raise aerospike client pool to 128 for docker.m#941
oskarszoon merged 4 commits into
bsv-blockchain:mainfrom
oskarszoon:chore/aerospike-pool-128-docker-m

Conversation

@oskarszoon

@oskarszoon oskarszoon commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Raise ConnectionQueueSize from 16 to 128 (and MinConnectionsPerNode from 8 to 16) on the utxostore.docker.m URL. docker.m (docker-microservice) context only — used by teranode-quickstart and other docker-based deployments. Other contexts unchanged.

Why

ConnectionQueueSize=16 + LimitConnectionsToQueueSize=true is too tight for the default pruner partition-worker fanout, and for legacy block-processing batch ops under mainnet IBD load.

The pruner detects this and emits:

WARN | pruner/pruner_service.go:425 | utxos | 
Pruner concurrency would exhaust Aerospike connection pool. 
Max pruner connections: 64, ConnectionQueueSize: 16, Recommended max: 11. 
Auto-adjusting pruner_utxoChunkGroupLimit from 1 to 1 to prevent exhaustion.

It auto-throttles pruner_utxoChunkGroupLimit but not pruner_utxoPartitionQueries (the outer 32-worker fanout). Even at chunk_group_limit=1, 32 partition workers × 1 = 32 concurrent ops, which still exceeds the 16-conn pool. Every pruner batch then errors with NO_AVAILABLE_CONNECTIONS_TO_NODE followed by TIMEOUT.

Observed

bsva-ovh-teranode-eu-3, mainnet, v0.15.2-beta-1, height ~795,360.

Cascade with the old pool size:

  1. Legacy retries on fat blocks (compounded by legacy: createUtxos calls SetMinedMulti with unbounded slice — stalls aerospike on fat blocks (regression from #854) #936 before its fix) leak some client connections faster than they close
  2. Pool saturates at 16
  3. Pruner can't issue parent-update writes — perpetually NO_AVAILABLE_CONNECTIONS_TO_NODE
  4. Spent UTXOs aren't deleted; objects climb to 832M
  5. stop-writes-used-pct=70 (inherited from evict-used-pct=70) trips
  6. Aerospike enters stop_writes=true + hwm_breached=true
  7. Pruner needs to write to delete things, but stop_writes blocks writes → deadlock

The 16-conn ceiling was the bottleneck, not the aerospike server: proto-fd-max defaults to 15,000, and the actual server-side connection count peaked around 72 (across all teranode services combined) before the lockup, so there's plenty of server headroom.

Why 128 specifically

With the default pruner_utxoPartitionQueries=0 (auto-detect, ~32 workers on a typical host) and a chunk_group_limit that can rise to its 10 default once the pool is comfortable, peak pruner concurrency is ~320 ops. 128 connections gives that 2.5× headroom for the auto-adjust to land at 5–8 chunk groups rather than being pinned at 1. Other utxostore consumers (legacy, blockvalidation, subtreevalidation, blockassembly, blockchain, propagation, asset, pruner) each get their own pool; at 128/pool × 8 clients = 1,024 total potential connections, well under proto-fd-max=15000.

MinConnectionsPerNode raised proportionally (8 → 16) so warm-up provisions a reasonable baseline.

Scope

Only utxostore.docker.m changes. The commented-out utxostore.docker template right below remains at ConnectionQueueSize=32 (it's a template, not active). Operator/k8s contexts unchanged.

Verification

  • Config diff is one line, ConnectionQueueSize=16128, MinConnectionsPerNode=816
  • Operator validation: bring up a docker.m deployment (e.g. teranode-quickstart), confirm docker exec aerospike asinfo -v statistics | grep client_connections reflects the larger ceiling and the pruner WARN at pruner_service.go:425 no longer auto-throttles to chunk_group_limit=1

Related

Not in this PR

  • The underlying aerospike-client-go/v8 connection-handling on timeout (InDoubt: true paths) leaks slower than steady-state churn returns them. Worth a separate investigation. Bigger pool buys time, doesn't eliminate the leak.
  • evict-used-pct=70 in config/aerospike.conf template silently anchors stop-writes-used-pct=70 even though eviction is a no-op for default-ttl 0. Separate concern; affects quickstart configs.
  • The pruner auto-adjust should also throttle pruner_utxoPartitionQueries, not just pruner_utxoChunkGroupLimit. Separate fix.

ConnectionQueueSize=16 + LimitConnectionsToQueueSize=true is too tight
for the pruner's default partition-worker fanout (32 workers ×
chunk_group_limit) and for legacy block-processing batch ops under
mainnet IBD load. Once the pool fills, requests time out, the pruner
auto-adjusts pruner_utxoChunkGroupLimit to 1 (per the WARN at
pruner_service.go:425) but its outer partition-worker count is not
auto-throttled, so it still oversubscribes the pool and pruning stalls.

Observed on bsva-ovh-teranode-eu-3 mainnet (v0.15.2-beta-1, mainnet
height ~795360): pruner stuck with NO_AVAILABLE_CONNECTIONS_TO_NODE
errors despite legacy stopped; aerospike namespace hwm-breached +
stop_writes=true while 832M non-expirable records accumulated because
the pruner couldn't write deletes through the saturated pool.

Bump ConnectionQueueSize 16 -> 128 and MinConnectionsPerNode 8 -> 16
for the docker.m context only. Single-node aerospike with proto-fd-max
default 15000 has plenty of server-side headroom; the 16-conn ceiling
was the constraint, not the server.
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Review Summary

This PR increases Aerospike connection pool settings for the docker.m context to prevent connection exhaustion under mainnet IBD load. The changes are minimal, well-scoped, and properly documented.

Configuration changes verified:

  • settings.conf: ConnectionQueueSize 16→128, MinConnectionsPerNode 8→16 for utxostore.docker.m
  • New context-specific override: pruner_utxoPartitionQueries.docker.m = 8
  • Documentation updated to reflect actual code default (128)

Scope correctly limited:

  • Only docker.m context affected as intended
  • Kubernetes operator config unchanged (still 16, which is appropriate per PR description)
  • Commented template unchanged

Documentation accuracy:
All documentation changes accurately reflect the code. The docs previously showed default 256 in the table but the actual code default is 128 (util/uaerospike/client.go:18), so the doc update to 128 is a correction not just cosmetic alignment.

No issues found. Changes align with AGENTS.md principles: minimal scope, properly verified plan, clear rationale in PR description.

@oskarszoon oskarszoon enabled auto-merge (squash) May 26, 2026 07:39
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-941 (0f98f93)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.974µ 1.722µ ~ 0.200
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.67n 61.69n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.72n 61.90n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.75n 61.73n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 29.77n 29.93n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 51.51n 50.12n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 110.6n 116.2n ~ 0.700
MiningCandidate_Stringify_Short-4 262.5n 261.4n ~ 0.400
MiningCandidate_Stringify_Long-4 1.892µ 1.855µ ~ 0.100
MiningSolution_Stringify-4 984.4n 970.8n ~ 0.100
BlockInfo_MarshalJSON-4 1.789µ 1.793µ ~ 1.000
NewFromBytes-4 128.4n 128.4n ~ 1.000
AddTxBatchColumnar_Validation-4 2.471µ 2.535µ ~ 0.100
OffsetValidationLoop-4 635.3n 634.5n ~ 1.000
Mine_EasyDifficulty-4 65.96µ 65.65µ ~ 0.700
Mine_WithAddress-4 7.026µ 7.928µ ~ 0.100
BlockAssembler_AddTx-4 0.02819n 0.02859n ~ 1.000
AddNode-4 11.74 10.82 ~ 0.200
AddNodeWithMap-4 11.56 11.27 ~ 1.000
DiskTxMap_SetIfNotExists-4 3.647µ 3.738µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 4.226µ 18.282µ ~ 0.700
DiskTxMap_ExistenceOnly-4 403.4n 341.6n ~ 0.700
Queue-4 148.3n 150.5n ~ 0.100
AtomicPointer-4 2.515n 2.492n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 629.6µ 632.5µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 616.3µ 601.0µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 80.74µ 81.28µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/10K-4 49.96µ 49.51µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 39.09µ 41.40µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 8.509µ 8.716µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/10K-4 3.317µ 3.253µ ~ 0.700
ReorgOptimizations/NodeFlags/New/10K-4 1.120µ 1.133µ ~ 0.400
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 7.658m 7.585m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 8.432m 7.783m ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4 869.0µ 863.3µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/100K-4 547.2µ 545.5µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 408.4µ 378.8µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 202.0µ 199.3µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 33.71µ 36.27µ ~ 0.700
ReorgOptimizations/NodeFlags/New/100K-4 12.73µ 11.75µ ~ 0.100
TxMapSetIfNotExists-4 38.14n 38.88n ~ 0.100
TxMapSetIfNotExistsDuplicate-4 31.86n 32.22n ~ 0.100
ChannelSendReceive-4 447.3n 443.1n ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 76.36n 76.77n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 40.96n 41.47n ~ 0.200
DirectSubtreeAdd/256_per_subtree-4 40.38n 39.85n ~ 0.200
DirectSubtreeAdd/1024_per_subtree-4 38.43n 38.46n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 38.12n 38.02n ~ 0.400
SubtreeProcessorAdd/4_per_subtree-4 369.3n 358.2n ~ 0.100
SubtreeProcessorAdd/64_per_subtree-4 357.2n 350.7n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 339.1n 336.8n ~ 0.400
SubtreeProcessorAdd/1024_per_subtree-4 334.5n 336.3n ~ 0.700
SubtreeProcessorAdd/2048_per_subtree-4 340.8n 346.4n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 340.7n 350.9n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 339.0n 349.0n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 336.0n 349.8n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 339.0n 338.2n ~ 0.700
SubtreeNodeAddOnly/4_per_subtree-4 88.15n 88.37n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 65.05n 64.90n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 64.37n 64.06n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 63.60n 63.65n ~ 0.700
SubtreeCreationOnly/4_per_subtree-4 147.4n 147.8n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 526.8n 538.4n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.907µ 1.925µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 6.203µ 6.254µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 11.24µ 11.18µ ~ 0.700
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 342.5n 341.2n ~ 1.000
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 341.6n 337.2n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.389m 2.342m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 6.675m 6.480m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 8.497m 8.139m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 11.72m 11.24m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.977m 1.955m ~ 0.700
SequentialGetAndSetIfNotExists/10k_nodes-4 5.586m 5.477m ~ 0.400
SequentialGetAndSetIfNotExists/50k_nodes-4 17.02m 16.21m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 29.90m 31.45m ~ 0.200
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.399m 2.417m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 9.500m 9.500m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 14.68m 14.51m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 2.057m 2.028m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 9.259m 8.819m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 58.03m 55.00m ~ 0.100
CalcBlockWork-4 357.0n 364.1n ~ 1.000
CalculateWork-4 480.1n 500.0n ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.342µ 1.356µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.10µ 13.18µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 156.4µ 160.3µ ~ 0.700
CatchupWithHeaderCache-4 104.5m 104.5m ~ 1.000
SubtreeSizes/10k_tx_4_per_subtree-4 1.341m 1.320m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 313.6µ 313.8µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 75.09µ 75.11µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 18.71µ 18.92µ ~ 0.200
SubtreeSizes/10k_tx_512_per_subtree-4 9.377µ 9.397µ ~ 0.200
SubtreeSizes/10k_tx_1024_per_subtree-4 4.696µ 4.664µ ~ 0.700
SubtreeSizes/10k_tx_2k_per_subtree-4 2.335µ 2.325µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 73.66µ 74.49µ ~ 1.000
BlockSizeScaling/10k_tx_256_per_subtree-4 18.65µ 18.72µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.672µ 4.683µ ~ 0.500
BlockSizeScaling/50k_tx_64_per_subtree-4 385.2µ 386.0µ ~ 1.000
BlockSizeScaling/50k_tx_256_per_subtree-4 91.94µ 92.84µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 22.87µ 23.03µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 160.3µ 161.4µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 159.6µ 160.1µ ~ 1.000
SubtreeAllocations/small_subtrees_full_validation-4 321.5µ 321.6µ ~ 1.000
SubtreeAllocations/medium_subtrees_exists_check-4 9.504µ 9.630µ ~ 0.700
SubtreeAllocations/medium_subtrees_data_fetch-4 9.353µ 9.435µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 18.75µ 18.81µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.296µ 2.265µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.253µ 2.268µ ~ 0.700
SubtreeAllocations/large_subtrees_full_validation-4 4.683µ 4.717µ ~ 0.700
_BufferPoolAllocation/16KB-4 4.980µ 3.681µ ~ 0.100
_BufferPoolAllocation/32KB-4 7.415µ 7.184µ ~ 0.100
_BufferPoolAllocation/64KB-4 14.31µ 16.94µ ~ 0.400
_BufferPoolAllocation/128KB-4 25.05µ 27.02µ ~ 0.200
_BufferPoolAllocation/512KB-4 113.4µ 106.3µ ~ 0.700
_BufferPoolConcurrent/32KB-4 18.28µ 18.22µ ~ 1.000
_BufferPoolConcurrent/64KB-4 29.01µ 29.00µ ~ 1.000
_BufferPoolConcurrent/512KB-4 142.8µ 149.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 611.9µ 607.6µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/32KB-4 614.9µ 606.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 613.7µ 613.1µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/128KB-4 606.8µ 591.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 595.6µ 586.5µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.60m 37.19m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.58m 37.06m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.44m 37.00m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.15m 36.74m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.45m 36.89m ~ 0.700
_PooledVsNonPooled/Pooled-4 833.9n 834.8n ~ 0.400
_PooledVsNonPooled/NonPooled-4 7.072µ 7.292µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 7.493µ 7.045µ ~ 0.700
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.650µ 9.808µ ~ 0.700
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.406µ 10.110µ ~ 0.100
_prepareTxsPerLevel-4 427.7m 428.2m ~ 1.000
_prepareTxsPerLevelOrdered-4 4.093m 5.300m ~ 0.200
_prepareTxsPerLevel_Comparison/Original-4 429.1m 429.5m ~ 1.000
_prepareTxsPerLevel_Comparison/Optimized-4 4.319m 4.942m ~ 0.400
StoreBlock_Sequential/BelowCSVHeight-4 335.3µ 334.4µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 335.9µ 335.9µ ~ 1.000
GetUtxoHashes-4 255.5n 252.8n ~ 1.000
GetUtxoHashes_ManyOutputs-4 48.77µ 49.14µ ~ 0.400
_NewMetaDataFromBytes-4 225.0n 228.7n ~ 0.200
_Bytes-4 410.0n 413.9n ~ 0.100
_MetaBytes-4 137.4n 137.4n ~ 1.000

Threshold: >10% with p < 0.05 | Generated: 2026-05-26 08:33 UTC

@blockpusher blockpusher left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@oskarszoon oskarszoon disabled auto-merge May 26, 2026 07:53
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants