Skip to content

Look up parent transactions in parallel instead of one at a time#864

Closed
freemans13 wants to merge 2 commits into
bsv-blockchain:mainfrom
freemans13:stu/aerospike-batch-prev-outputs-concurrent
Closed

Look up parent transactions in parallel instead of one at a time#864
freemans13 wants to merge 2 commits into
bsv-blockchain:mainfrom
freemans13:stu/aerospike-batch-prev-outputs-concurrent

Conversation

@freemans13

@freemans13 freemans13 commented May 14, 2026

Copy link
Copy Markdown
Collaborator

What this changes

When a block arrives, we need to look up the parent transaction outputs (amount + locking script) for every input in every transaction. Today we do this one transaction at a time. This PR does several transactions in parallel.

How it works today

Aerospike has a batcher in front of it. Instead of firing one call per input, the batcher collects inputs and sends them off in a single batch when either:

  • it has collected 100 inputs (size cap), or
  • 10 ms have passed since the first input arrived (timer cap).

Good design — but the old code fed the batcher one transaction at a time:

submit tx1's 3 inputs → wait for results → submit tx2's 3 inputs → wait → …

A typical tx only has 2 or 3 inputs, nowhere near the 100-input size cap. So the batcher always fell back to the 10 ms timer before flushing. Every transaction paid that 10 ms wait. On a 1000-tx block that's about 10 seconds of pure timer waiting, even if Aerospike itself is fast.

The fix

Dispatch each transaction in its own goroutine, bounded by UtxoStore.BatchPreviousOutputsDecorateConcurrency (the same knob the SQL store already uses, default 4). Now several transactions feed inputs into the shared batcher at the same time, so:

  • each flush carries inputs from multiple txs instead of just one
  • the 10 ms timer wait gets shared across many txs instead of paid once per tx
  • if you raise the concurrency high enough, the batcher hits its 100-input size cap and flushes immediately — no timer wait at all

The default of 4 is intentionally cautious. Each goroutine is mostly blocked waiting for batcher results, so a small fan-out is enough to keep the batcher busy. Operators can dial it up.

Benchmark — proof it's actually faster

stores/utxo/aerospike/batch_decorate_bench_test.go drives the real BatchPreviousOutputsDecorate against the real go-batcher with production-default settings (100 items / 10 ms timer). The only thing faked is the Aerospike call itself — replaced with a 500 µs sleep that models a real Aerospike round-trip.

1000-tx × 3-input block, Apple M3 Max:

Mode Time per block Speedup
Sequential (today's code) 10.68 s 1.00× — baseline
Concurrent, cap=1 10.72 s 1.00× — same as sequential, sanity check
Concurrent, cap=4 (PR default) 2.68 s 3.99×
Concurrent, cap=16 0.68 s 15.8×
Concurrent, cap=64 0.019 s ~570×

Two things to take from this:

  1. Cap=1 behaves exactly like the old code. That's the zero-risk fallback — if anyone hits trouble, setting utxostore_batchPreviousOutputsDecorateConcurrency=1 restores the old behaviour.
  2. At the default cap=4, blocks decorate about 4× faster. That's the timer-sharing effect exactly as the maths predicts.

Why is the default only 4 if cap=64 is 570× faster?

Two reasons:

  • Matches the existing SQL store knob. SQL store uses the same setting at default 4, so operators only have one number to think about.
  • Cautious. An earlier version of this change (in perf(netsync): broaden quickValidationMode to cover CATCHINGBLOCKS + behind-tip #723) used OutpointBatcherSize=1024 as the concurrency cap, which triggered a known legacy-sync OOM. Default 4 has no memory risk; operators with headroom can raise it.

Tests

  • go build ./... clean
  • go vet ./stores/utxo/aerospike/... clean
  • go test -race -short ./stores/utxo/aerospike/ — 464 passed
  • New benchmark above

How this relates to #865

Both PRs attack the same 10 ms timer problem, at different layers:

They compose. Drain mode plus concurrent dispatch is faster than either alone.

Context

Originally bundled in #723. Pulled out as a focused PR after #723's core change (broadening quickValidationMode to CATCHINGBLOCKS) was superseded by #847.

Aerospike's BatchPreviousOutputsDecorate ran PreviousOutputsDecorate
sequentially per tx. PreviousOutputsDecorate enqueues each input into
the shared outpointBatcher and then waits for its inputs' results;
running it concurrently across txs lets the batcher coalesce inputs
from many transactions into single Aerospike batch reads — the whole
point of having a shared batcher.

The sequential loop submitted one tx's inputs, drained them, then
moved on, so the batcher only ever saw one tx's inputs at a time and
sized batches accordingly. On large blocks that left most of the
batcher's capacity unused on every flush.

Fix: dispatch each tx in its own goroutine via errgroup, bounded by
UtxoStore.BatchPreviousOutputsDecorateConcurrency (the same knob the
SQL store already uses, default 4). The default is intentionally low
— each goroutine is mostly waiting on batcher result channels, so
small fan-out is enough to keep the batcher saturated without
spawning thousands of goroutines on huge blocks.

Originally bundled in bsv-blockchain#723; pulled out as a focused PR after bsv-blockchain#723's
core change (CATCHINGBLOCKS quick-validation gating) was superseded
by bsv-blockchain#847. The original PR used OutpointBatcherSize as the concurrency
cap, which would default to 1024 — known to trigger OOM on legacy
sync. Reusing the SQL store's existing setting keeps the operator
surface identical and the default safe.

Tests:
- go build ./...
- go vet ./stores/utxo/aerospike/...
- go test -race -short ./stores/utxo/aerospike/ — 464 passed
  (TestStore_IncrementSpentRecords passes standalone; an earlier
  full-suite timeout was unrelated to this change)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Current Review:

  • No issues found
  • Implementation is correct and follows project patterns
  • Benchmark design accurately models the performance improvement
  • Documentation is clear and explains the rationale well

Summary:
The PR introduces bounded concurrency to BatchPreviousOutputsDecorate for the Aerospike UTXO store, allowing multiple transactions to feed the shared batcher concurrently. This matches the existing SQL store implementation pattern and uses the same configuration knob (BatchPreviousOutputsDecorateConcurrency).

The implementation is sound: it properly bounds goroutines using errgroup with SafeSetLimit, handles context cancellation, and maintains backward compatibility (concurrency=1 restores sequential behavior). The benchmark accurately models the batcher's size-or-timer flush rule and demonstrates the expected ~4× speedup at the default concurrency of 4.

Code follows AGENTS.md requirements for minimal changes, clear documentation, and correct error handling.

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-864 (4125006)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.677µ 1.687µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.64n 61.88n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.82n 61.59n ~ 0.300
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.72n 61.58n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 30.47n 31.20n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 52.57n 53.75n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 116.7n 122.2n ~ 0.400
MiningCandidate_Stringify_Short-4 263.0n 264.7n ~ 0.100
MiningCandidate_Stringify_Long-4 1.917µ 1.920µ ~ 1.000
MiningSolution_Stringify-4 996.2n 993.4n ~ 0.100
BlockInfo_MarshalJSON-4 1.801µ 1.790µ ~ 0.700
NewFromBytes-4 130.9n 158.9n ~ 0.700
Mine_EasyDifficulty-4 65.42µ 65.86µ ~ 0.400
Mine_WithAddress-4 7.288µ 7.027µ ~ 0.700
BlockAssembler_AddTx-4 0.02965n 0.02929n ~ 1.000
AddNode-4 10.96 11.07 ~ 1.000
AddNodeWithMap-4 11.05 10.74 ~ 0.400
DiskTxMap_SetIfNotExists-4 3.500µ 3.607µ ~ 0.100
DiskTxMap_SetIfNotExists_Parallel-4 3.451µ 3.443µ ~ 1.000
DiskTxMap_ExistenceOnly-4 311.7n 332.8n ~ 0.100
Queue-4 188.8n 191.0n ~ 0.400
AtomicPointer-4 4.892n 4.654n ~ 0.700
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 904.9µ 906.2µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 827.2µ 854.5µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 105.6µ 123.3µ ~ 0.100
ReorgOptimizations/AllMarkFalse/New/10K-4 62.66µ 62.52µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 64.79µ 69.77µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 12.55µ 11.82µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 4.814µ 5.102µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.607µ 1.809µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.171m 9.964m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 9.404m 9.573m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.104m 1.189m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 687.6µ 691.1µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/100K-4 541.1µ 551.0µ ~ 0.400
ReorgOptimizations/HashSlicePool/New/100K-4 306.9µ 342.1µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/100K-4 49.47µ 50.53µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 17.26µ 17.65µ ~ 0.700
TxMapSetIfNotExists-4 51.40n 51.49n ~ 0.400
TxMapSetIfNotExistsDuplicate-4 38.09n 38.10n ~ 1.000
ChannelSendReceive-4 616.5n 615.3n ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 58.82n 57.45n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 29.79n 29.09n ~ 0.200
DirectSubtreeAdd/256_per_subtree-4 27.90n 28.67n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 26.51n 26.46n ~ 1.000
DirectSubtreeAdd/2048_per_subtree-4 25.99n 26.03n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 288.3n 283.5n ~ 0.700
SubtreeProcessorAdd/64_per_subtree-4 281.9n 278.8n ~ 0.700
SubtreeProcessorAdd/256_per_subtree-4 283.0n 279.5n ~ 0.200
SubtreeProcessorAdd/1024_per_subtree-4 271.7n 269.4n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 272.5n 269.3n ~ 0.400
SubtreeProcessorRotate/4_per_subtree-4 274.6n 275.6n ~ 0.800
SubtreeProcessorRotate/64_per_subtree-4 273.8n 272.4n ~ 0.200
SubtreeProcessorRotate/256_per_subtree-4 274.8n 272.0n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 272.2n 271.9n ~ 1.000
SubtreeNodeAddOnly/4_per_subtree-4 55.07n 55.19n ~ 0.400
SubtreeNodeAddOnly/64_per_subtree-4 35.98n 36.05n ~ 0.700
SubtreeNodeAddOnly/256_per_subtree-4 35.16n 34.98n ~ 0.100
SubtreeNodeAddOnly/1024_per_subtree-4 34.45n 34.68n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 111.1n 111.2n ~ 1.000
SubtreeCreationOnly/64_per_subtree-4 358.2n 346.1n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.218µ 1.309µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 3.776µ 3.939µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 6.662µ 7.228µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 273.2n 271.5n ~ 0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 275.3n 274.0n ~ 1.000
ParallelGetAndSetIfNotExists/1k_nodes-4 548.7µ 544.0µ ~ 0.700
ParallelGetAndSetIfNotExists/10k_nodes-4 1.345m 1.357m ~ 0.400
ParallelGetAndSetIfNotExists/50k_nodes-4 6.840m 6.734m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 13.78m 13.57m ~ 0.200
SequentialGetAndSetIfNotExists/1k_nodes-4 625.8µ 619.4µ ~ 0.700
SequentialGetAndSetIfNotExists/10k_nodes-4 2.877m 2.884m ~ 1.000
SequentialGetAndSetIfNotExists/50k_nodes-4 11.27m 11.12m ~ 0.200
SequentialGetAndSetIfNotExists/100k_nodes-4 21.85m 21.31m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 590.3µ 592.6µ ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.601m 4.592m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 17.34m 17.38m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 670.8µ 669.0µ ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 6.271m 6.208m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 39.91m 39.40m ~ 0.100
CalcBlockWork-4 464.3n 468.7n ~ 0.200
CalculateWork-4 661.1n 625.3n ~ 1.000
BuildBlockLocatorString_Helpers/Size_10-4 1.349µ 1.292µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_100-4 12.54µ 13.28µ ~ 0.700
BuildBlockLocatorString_Helpers/Size_1000-4 122.5µ 123.0µ ~ 0.200
CatchupWithHeaderCache-4 104.5m 104.7m ~ 0.200
SubtreeSizes/10k_tx_4_per_subtree-4 1.346m 1.374m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 320.7µ 322.2µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 76.91µ 77.45µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 19.10µ 19.21µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 9.477µ 9.396µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.703µ 4.754µ ~ 0.700
SubtreeSizes/10k_tx_2k_per_subtree-4 2.363µ 2.342µ ~ 0.100
BlockSizeScaling/10k_tx_64_per_subtree-4 74.59µ 74.47µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 18.70µ 18.91µ ~ 0.200
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.641µ 4.692µ ~ 0.200
BlockSizeScaling/50k_tx_64_per_subtree-4 388.6µ 394.4µ ~ 0.100
BlockSizeScaling/50k_tx_256_per_subtree-4 93.84µ 93.50µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.03µ 22.97µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 154.7µ 156.1µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 166.6µ 166.5µ ~ 1.000
SubtreeAllocations/small_subtrees_full_validation-4 325.2µ 320.2µ ~ 0.200
SubtreeAllocations/medium_subtrees_exists_check-4 9.146µ 9.151µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 9.748µ 9.645µ ~ 1.000
SubtreeAllocations/medium_subtrees_full_validation-4 18.96µ 18.79µ ~ 0.400
SubtreeAllocations/large_subtrees_exists_check-4 2.173µ 2.193µ ~ 0.400
SubtreeAllocations/large_subtrees_data_fetch-4 2.354µ 2.359µ ~ 0.800
SubtreeAllocations/large_subtrees_full_validation-4 4.686µ 4.719µ ~ 0.700
_prepareTxsPerLevel-4 404.7m 400.9m ~ 1.000
_prepareTxsPerLevelOrdered-4 3.795m 3.441m ~ 0.200
_prepareTxsPerLevel_Comparison/Original-4 406.1m 407.7m ~ 1.000
_prepareTxsPerLevel_Comparison/Optimized-4 3.654m 3.634m ~ 1.000
_BufferPoolAllocation/16KB-4 3.645µ 3.724µ ~ 1.000
_BufferPoolAllocation/32KB-4 7.771µ 7.789µ ~ 1.000
_BufferPoolAllocation/64KB-4 16.59µ 18.33µ ~ 0.400
_BufferPoolAllocation/128KB-4 32.87µ 30.56µ ~ 0.100
_BufferPoolAllocation/512KB-4 121.7µ 125.1µ ~ 0.200
_BufferPoolConcurrent/32KB-4 17.28µ 18.58µ ~ 0.100
_BufferPoolConcurrent/64KB-4 27.24µ 28.54µ ~ 0.100
_BufferPoolConcurrent/512KB-4 173.8µ 175.0µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 707.9µ 705.7µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/32KB-4 712.7µ 700.2µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/64KB-4 698.4µ 686.3µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/128KB-4 705.1µ 673.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 704.5µ 717.3µ ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/16KB-4 38.10m 38.08m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/32KB-4 37.97m 38.15m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4 37.96m 38.44m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/128KB-4 37.80m 38.51m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 38.02m 38.02m ~ 0.700
_PooledVsNonPooled/Pooled-4 821.2n 821.5n ~ 0.400
_PooledVsNonPooled/NonPooled-4 6.840µ 6.584µ ~ 0.700
_MemoryFootprint/Current_512KB_32concurrent-4 9.314µ 8.502µ ~ 0.200
_MemoryFootprint/Proposed_32KB_32concurrent-4 11.15µ 11.64µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 10.97µ 11.14µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 320.8µ 312.5µ ~ 0.400
StoreBlock_Sequential/AboveCSVHeight-4 307.6µ 307.7µ ~ 1.000
GetUtxoHashes-4 258.1n 256.2n ~ 1.000
GetUtxoHashes_ManyOutputs-4 43.32µ 45.45µ ~ 0.100
_NewMetaDataFromBytes-4 237.8n 237.0n ~ 0.400
_Bytes-4 628.7n 622.2n ~ 0.700
_MetaBytes-4 567.5n 568.5n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-05-14 11:38 UTC

…s concurrent

Direct head-to-head benchmark of the old sequential per-tx loop against
the new concurrent dispatch, holding everything else equal. Uses the
real go-batcher with production defaults (100 items / 10 ms timer) and
stubs the Aerospike flush callback with a 500 µs simulated batch
round-trip — modelling per-batch latency that's independent of batch
size, which matches Aerospike's real behaviour.

On a 1000-tx × 3-input block, Apple M3 Max:

  sequential       10.68 s/op   (baseline)
  concurrent_1     10.72 s/op   1.00x (sanity check)
  concurrent_4      2.68 s/op   3.99x (PR default)
  concurrent_16     0.68 s/op  15.8x
  concurrent_64    0.019 s/op  570x

concurrent_1 matches sequential almost exactly, confirming the new
implementation behaves identically to the old one when concurrency=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@freemans13 freemans13 self-assigned this May 14, 2026
@freemans13 freemans13 changed the title perf(utxo/aerospike): parallelise BatchPreviousOutputsDecorate Look up parent transactions in parallel instead of one at a time May 19, 2026
@freemans13

Copy link
Copy Markdown
Collaborator Author

Closing as superseded by #893, which landed essentially the same fix (concurrent fan-out of BatchPreviousOutputsDecorate via errgroup).

One thing worth flagging from this PR's history: #893 bounds the fan-out by UtxoStore.OutpointBatcherSize (default 1024), whereas this PR deliberately used a new dedicated knob BatchPreviousOutputsDecorateConcurrency with default 4 to avoid the legacy-sync OOM seen when an earlier attempt (#723) ran at 1024. Operators running legacy sync should keep an eye on RAM on the #893 path; a follow-up to either lower the default or introduce a dedicated cap may be worth considering.

The benchmark added here (stores/utxo/aerospike/batch_decorate_bench_test.go) isn't in #893 — happy to port it across in a separate PR if that's useful.

@freemans13 freemans13 closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant