Look up parent transactions in parallel instead of one at a time by freemans13 · Pull Request #864 · bsv-blockchain/teranode

freemans13 · 2026-05-14T11:01:22Z

What this changes

When a block arrives, we need to look up the parent transaction outputs (amount + locking script) for every input in every transaction. Today we do this one transaction at a time. This PR does several transactions in parallel.

How it works today

Aerospike has a batcher in front of it. Instead of firing one call per input, the batcher collects inputs and sends them off in a single batch when either:

it has collected 100 inputs (size cap), or
10 ms have passed since the first input arrived (timer cap).

Good design — but the old code fed the batcher one transaction at a time:

submit tx1's 3 inputs → wait for results → submit tx2's 3 inputs → wait → …

A typical tx only has 2 or 3 inputs, nowhere near the 100-input size cap. So the batcher always fell back to the 10 ms timer before flushing. Every transaction paid that 10 ms wait. On a 1000-tx block that's about 10 seconds of pure timer waiting, even if Aerospike itself is fast.

The fix

Dispatch each transaction in its own goroutine, bounded by UtxoStore.BatchPreviousOutputsDecorateConcurrency (the same knob the SQL store already uses, default 4). Now several transactions feed inputs into the shared batcher at the same time, so:

each flush carries inputs from multiple txs instead of just one
the 10 ms timer wait gets shared across many txs instead of paid once per tx
if you raise the concurrency high enough, the batcher hits its 100-input size cap and flushes immediately — no timer wait at all

The default of 4 is intentionally cautious. Each goroutine is mostly blocked waiting for batcher results, so a small fan-out is enough to keep the batcher busy. Operators can dial it up.

Benchmark — proof it's actually faster

stores/utxo/aerospike/batch_decorate_bench_test.go drives the real BatchPreviousOutputsDecorate against the real go-batcher with production-default settings (100 items / 10 ms timer). The only thing faked is the Aerospike call itself — replaced with a 500 µs sleep that models a real Aerospike round-trip.

1000-tx × 3-input block, Apple M3 Max:

Mode	Time per block	Speedup
Sequential (today's code)	10.68 s	1.00× — baseline
Concurrent, cap=1	10.72 s	1.00× — same as sequential, sanity check
Concurrent, cap=4 (PR default)	2.68 s	3.99×
Concurrent, cap=16	0.68 s	15.8×
Concurrent, cap=64	0.019 s	~570×

Two things to take from this:

Cap=1 behaves exactly like the old code. That's the zero-risk fallback — if anyone hits trouble, setting utxostore_batchPreviousOutputsDecorateConcurrency=1 restores the old behaviour.
At the default cap=4, blocks decorate about 4× faster. That's the timer-sharing effect exactly as the maths predicts.

Why is the default only 4 if cap=64 is 570× faster?

Two reasons:

Matches the existing SQL store knob. SQL store uses the same setting at default 4, so operators only have one number to think about.
Cautious. An earlier version of this change (in perf(netsync): broaden quickValidationMode to cover CATCHINGBLOCKS + behind-tip #723) used OutpointBatcherSize=1024 as the concurrency cap, which triggered a known legacy-sync OOM. Default 4 has no memory risk; operators with headroom can raise it.

Tests

go build ./... clean
go vet ./stores/utxo/aerospike/... clean
go test -race -short ./stores/utxo/aerospike/ — 464 passed
New benchmark above

How this relates to #865

Both PRs attack the same 10 ms timer problem, at different layers:

Add opt-in drain mode for the outpoint batcher (off by default) #865: tells the batcher to flush immediately instead of waiting for the timer (one setting, opt-in).
This PR: parallelises transaction dispatch above the batcher, so the batcher fills up faster.

They compose. Drain mode plus concurrent dispatch is faster than either alone.

Context

Originally bundled in #723. Pulled out as a focused PR after #723's core change (broadening quickValidationMode to CATCHINGBLOCKS) was superseded by #847.

Aerospike's BatchPreviousOutputsDecorate ran PreviousOutputsDecorate sequentially per tx. PreviousOutputsDecorate enqueues each input into the shared outpointBatcher and then waits for its inputs' results; running it concurrently across txs lets the batcher coalesce inputs from many transactions into single Aerospike batch reads — the whole point of having a shared batcher. The sequential loop submitted one tx's inputs, drained them, then moved on, so the batcher only ever saw one tx's inputs at a time and sized batches accordingly. On large blocks that left most of the batcher's capacity unused on every flush. Fix: dispatch each tx in its own goroutine via errgroup, bounded by UtxoStore.BatchPreviousOutputsDecorateConcurrency (the same knob the SQL store already uses, default 4). The default is intentionally low — each goroutine is mostly waiting on batcher result channels, so small fan-out is enough to keep the batcher saturated without spawning thousands of goroutines on huge blocks. Originally bundled in bsv-blockchain#723; pulled out as a focused PR after bsv-blockchain#723's core change (CATCHINGBLOCKS quick-validation gating) was superseded by bsv-blockchain#847. The original PR used OutpointBatcherSize as the concurrency cap, which would default to 1024 — known to trigger OOM on legacy sync. Reusing the SQL store's existing setting keeps the operator surface identical and the default safe. Tests: - go build ./... - go vet ./stores/utxo/aerospike/... - go test -race -short ./stores/utxo/aerospike/ — 464 passed (TestStore_IncrementSpentRecords passes standalone; an earlier full-suite timeout was unrelated to this change) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-14T11:03:11Z

🤖 Claude Code Review

Status: Complete

Current Review:

No issues found
Implementation is correct and follows project patterns
Benchmark design accurately models the performance improvement
Documentation is clear and explains the rationale well

Summary:
The PR introduces bounded concurrency to BatchPreviousOutputsDecorate for the Aerospike UTXO store, allowing multiple transactions to feed the shared batcher concurrently. This matches the existing SQL store implementation pattern and uses the same configuration knob (BatchPreviousOutputsDecorateConcurrency).

The implementation is sound: it properly bounds goroutines using errgroup with SafeSetLimit, handles context cancellation, and maintains backward compatibility (concurrency=1 restores sequential behavior). The benchmark accurately models the batcher's size-or-timer flush rule and demonstrates the expected ~4× speedup at the default concurrency of 4.

Code follows AGENTS.md requirements for minimal changes, clear documentation, and correct error handling.

github-actions · 2026-05-14T11:16:11Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-864 (4125006)

Summary

Regressions: 0
Improvements: 0
Unchanged: 142
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.677µ	1.687µ	~	0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	61.64n	61.88n	~	0.200
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	61.82n	61.59n	~	0.300
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	61.72n	61.58n	~	0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	30.47n	31.20n	~	0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	52.57n	53.75n	~	0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	116.7n	122.2n	~	0.400
MiningCandidate_Stringify_Short-4	263.0n	264.7n	~	0.100
MiningCandidate_Stringify_Long-4	1.917µ	1.920µ	~	1.000
MiningSolution_Stringify-4	996.2n	993.4n	~	0.100
BlockInfo_MarshalJSON-4	1.801µ	1.790µ	~	0.700
NewFromBytes-4	130.9n	158.9n	~	0.700
Mine_EasyDifficulty-4	65.42µ	65.86µ	~	0.400
Mine_WithAddress-4	7.288µ	7.027µ	~	0.700
BlockAssembler_AddTx-4	0.02965n	0.02929n	~	1.000
AddNode-4	10.96	11.07	~	1.000
AddNodeWithMap-4	11.05	10.74	~	0.400
DiskTxMap_SetIfNotExists-4	3.500µ	3.607µ	~	0.100
DiskTxMap_SetIfNotExists_Parallel-4	3.451µ	3.443µ	~	1.000
DiskTxMap_ExistenceOnly-4	311.7n	332.8n	~	0.100
Queue-4	188.8n	191.0n	~	0.400
AtomicPointer-4	4.892n	4.654n	~	0.700
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	904.9µ	906.2µ	~	1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4	827.2µ	854.5µ	~	0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4	105.6µ	123.3µ	~	0.100
ReorgOptimizations/AllMarkFalse/New/10K-4	62.66µ	62.52µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/10K-4	64.79µ	69.77µ	~	0.100
ReorgOptimizations/HashSlicePool/New/10K-4	12.55µ	11.82µ	~	0.100
ReorgOptimizations/NodeFlags/Old/10K-4	4.814µ	5.102µ	~	0.100
ReorgOptimizations/NodeFlags/New/10K-4	1.607µ	1.809µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	9.171m	9.964m	~	0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4	9.404m	9.573m	~	0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.104m	1.189m	~	0.100
ReorgOptimizations/AllMarkFalse/New/100K-4	687.6µ	691.1µ	~	0.400
ReorgOptimizations/HashSlicePool/Old/100K-4	541.1µ	551.0µ	~	0.400
ReorgOptimizations/HashSlicePool/New/100K-4	306.9µ	342.1µ	~	0.200
ReorgOptimizations/NodeFlags/Old/100K-4	49.47µ	50.53µ	~	0.100
ReorgOptimizations/NodeFlags/New/100K-4	17.26µ	17.65µ	~	0.700
TxMapSetIfNotExists-4	51.40n	51.49n	~	0.400
TxMapSetIfNotExistsDuplicate-4	38.09n	38.10n	~	1.000
ChannelSendReceive-4	616.5n	615.3n	~	0.700
DirectSubtreeAdd/4_per_subtree-4	58.82n	57.45n	~	1.000
DirectSubtreeAdd/64_per_subtree-4	29.79n	29.09n	~	0.200
DirectSubtreeAdd/256_per_subtree-4	27.90n	28.67n	~	0.100
DirectSubtreeAdd/1024_per_subtree-4	26.51n	26.46n	~	1.000
DirectSubtreeAdd/2048_per_subtree-4	25.99n	26.03n	~	0.700
SubtreeProcessorAdd/4_per_subtree-4	288.3n	283.5n	~	0.700
SubtreeProcessorAdd/64_per_subtree-4	281.9n	278.8n	~	0.700
SubtreeProcessorAdd/256_per_subtree-4	283.0n	279.5n	~	0.200
SubtreeProcessorAdd/1024_per_subtree-4	271.7n	269.4n	~	0.100
SubtreeProcessorAdd/2048_per_subtree-4	272.5n	269.3n	~	0.400
SubtreeProcessorRotate/4_per_subtree-4	274.6n	275.6n	~	0.800
SubtreeProcessorRotate/64_per_subtree-4	273.8n	272.4n	~	0.200
SubtreeProcessorRotate/256_per_subtree-4	274.8n	272.0n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	272.2n	271.9n	~	1.000
SubtreeNodeAddOnly/4_per_subtree-4	55.07n	55.19n	~	0.400
SubtreeNodeAddOnly/64_per_subtree-4	35.98n	36.05n	~	0.700
SubtreeNodeAddOnly/256_per_subtree-4	35.16n	34.98n	~	0.100
SubtreeNodeAddOnly/1024_per_subtree-4	34.45n	34.68n	~	0.100
SubtreeCreationOnly/4_per_subtree-4	111.1n	111.2n	~	1.000
SubtreeCreationOnly/64_per_subtree-4	358.2n	346.1n	~	0.100
SubtreeCreationOnly/256_per_subtree-4	1.218µ	1.309µ	~	0.100
SubtreeCreationOnly/1024_per_subtree-4	3.776µ	3.939µ	~	0.100
SubtreeCreationOnly/2048_per_subtree-4	6.662µ	7.228µ	~	0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	273.2n	271.5n	~	0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	275.3n	274.0n	~	1.000
ParallelGetAndSetIfNotExists/1k_nodes-4	548.7µ	544.0µ	~	0.700
ParallelGetAndSetIfNotExists/10k_nodes-4	1.345m	1.357m	~	0.400
ParallelGetAndSetIfNotExists/50k_nodes-4	6.840m	6.734m	~	0.100
ParallelGetAndSetIfNotExists/100k_nodes-4	13.78m	13.57m	~	0.200
SequentialGetAndSetIfNotExists/1k_nodes-4	625.8µ	619.4µ	~	0.700
SequentialGetAndSetIfNotExists/10k_nodes-4	2.877m	2.884m	~	1.000
SequentialGetAndSetIfNotExists/50k_nodes-4	11.27m	11.12m	~	0.200
SequentialGetAndSetIfNotExists/100k_nodes-4	21.85m	21.31m	~	0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	590.3µ	592.6µ	~	1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	4.601m	4.592m	~	0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	17.34m	17.38m	~	1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	670.8µ	669.0µ	~	1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	6.271m	6.208m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	39.91m	39.40m	~	0.100
CalcBlockWork-4	464.3n	468.7n	~	0.200
CalculateWork-4	661.1n	625.3n	~	1.000
BuildBlockLocatorString_Helpers/Size_10-4	1.349µ	1.292µ	~	0.200
BuildBlockLocatorString_Helpers/Size_100-4	12.54µ	13.28µ	~	0.700
BuildBlockLocatorString_Helpers/Size_1000-4	122.5µ	123.0µ	~	0.200
CatchupWithHeaderCache-4	104.5m	104.7m	~	0.200
SubtreeSizes/10k_tx_4_per_subtree-4	1.346m	1.374m	~	1.000
SubtreeSizes/10k_tx_16_per_subtree-4	320.7µ	322.2µ	~	1.000
SubtreeSizes/10k_tx_64_per_subtree-4	76.91µ	77.45µ	~	0.700
SubtreeSizes/10k_tx_256_per_subtree-4	19.10µ	19.21µ	~	1.000
SubtreeSizes/10k_tx_512_per_subtree-4	9.477µ	9.396µ	~	0.700
SubtreeSizes/10k_tx_1024_per_subtree-4	4.703µ	4.754µ	~	0.700
SubtreeSizes/10k_tx_2k_per_subtree-4	2.363µ	2.342µ	~	0.100
BlockSizeScaling/10k_tx_64_per_subtree-4	74.59µ	74.47µ	~	0.400
BlockSizeScaling/10k_tx_256_per_subtree-4	18.70µ	18.91µ	~	0.200
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.641µ	4.692µ	~	0.200
BlockSizeScaling/50k_tx_64_per_subtree-4	388.6µ	394.4µ	~	0.100
BlockSizeScaling/50k_tx_256_per_subtree-4	93.84µ	93.50µ	~	0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.03µ	22.97µ	~	0.400
SubtreeAllocations/small_subtrees_exists_check-4	154.7µ	156.1µ	~	0.700
SubtreeAllocations/small_subtrees_data_fetch-4	166.6µ	166.5µ	~	1.000
SubtreeAllocations/small_subtrees_full_validation-4	325.2µ	320.2µ	~	0.200
SubtreeAllocations/medium_subtrees_exists_check-4	9.146µ	9.151µ	~	1.000
SubtreeAllocations/medium_subtrees_data_fetch-4	9.748µ	9.645µ	~	1.000
SubtreeAllocations/medium_subtrees_full_validation-4	18.96µ	18.79µ	~	0.400
SubtreeAllocations/large_subtrees_exists_check-4	2.173µ	2.193µ	~	0.400
SubtreeAllocations/large_subtrees_data_fetch-4	2.354µ	2.359µ	~	0.800
SubtreeAllocations/large_subtrees_full_validation-4	4.686µ	4.719µ	~	0.700
_prepareTxsPerLevel-4	404.7m	400.9m	~	1.000
_prepareTxsPerLevelOrdered-4	3.795m	3.441m	~	0.200
_prepareTxsPerLevel_Comparison/Original-4	406.1m	407.7m	~	1.000
_prepareTxsPerLevel_Comparison/Optimized-4	3.654m	3.634m	~	1.000
_BufferPoolAllocation/16KB-4	3.645µ	3.724µ	~	1.000
_BufferPoolAllocation/32KB-4	7.771µ	7.789µ	~	1.000
_BufferPoolAllocation/64KB-4	16.59µ	18.33µ	~	0.400
_BufferPoolAllocation/128KB-4	32.87µ	30.56µ	~	0.100
_BufferPoolAllocation/512KB-4	121.7µ	125.1µ	~	0.200
_BufferPoolConcurrent/32KB-4	17.28µ	18.58µ	~	0.100
_BufferPoolConcurrent/64KB-4	27.24µ	28.54µ	~	0.100
_BufferPoolConcurrent/512KB-4	173.8µ	175.0µ	~	0.100
_SubtreeDeserializationWithBufferSizes/16KB-4	707.9µ	705.7µ	~	1.000
_SubtreeDeserializationWithBufferSizes/32KB-4	712.7µ	700.2µ	~	0.400
_SubtreeDeserializationWithBufferSizes/64KB-4	698.4µ	686.3µ	~	0.400
_SubtreeDeserializationWithBufferSizes/128KB-4	705.1µ	673.8µ	~	0.100
_SubtreeDeserializationWithBufferSizes/512KB-4	704.5µ	717.3µ	~	0.200
_SubtreeDataDeserializationWithBufferSizes/16KB-4	38.10m	38.08m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/32KB-4	37.97m	38.15m	~	0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4	37.96m	38.44m	~	0.200
_SubtreeDataDeserializationWithBufferSizes/128KB-4	37.80m	38.51m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4	38.02m	38.02m	~	0.700
_PooledVsNonPooled/Pooled-4	821.2n	821.5n	~	0.400
_PooledVsNonPooled/NonPooled-4	6.840µ	6.584µ	~	0.700
_MemoryFootprint/Current_512KB_32concurrent-4	9.314µ	8.502µ	~	0.200
_MemoryFootprint/Proposed_32KB_32concurrent-4	11.15µ	11.64µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	10.97µ	11.14µ	~	0.100
StoreBlock_Sequential/BelowCSVHeight-4	320.8µ	312.5µ	~	0.400
StoreBlock_Sequential/AboveCSVHeight-4	307.6µ	307.7µ	~	1.000
GetUtxoHashes-4	258.1n	256.2n	~	1.000
GetUtxoHashes_ManyOutputs-4	43.32µ	45.45µ	~	0.100
_NewMetaDataFromBytes-4	237.8n	237.0n	~	0.400
_Bytes-4	628.7n	622.2n	~	0.700
_MetaBytes-4	567.5n	568.5n	~	0.700

Threshold: >10% with p < 0.05 | Generated: 2026-05-14 11:38 UTC

…s concurrent Direct head-to-head benchmark of the old sequential per-tx loop against the new concurrent dispatch, holding everything else equal. Uses the real go-batcher with production defaults (100 items / 10 ms timer) and stubs the Aerospike flush callback with a 500 µs simulated batch round-trip — modelling per-batch latency that's independent of batch size, which matches Aerospike's real behaviour. On a 1000-tx × 3-input block, Apple M3 Max: sequential 10.68 s/op (baseline) concurrent_1 10.72 s/op 1.00x (sanity check) concurrent_4 2.68 s/op 3.99x (PR default) concurrent_16 0.68 s/op 15.8x concurrent_64 0.019 s/op 570x concurrent_1 matches sequential almost exactly, confirming the new implementation behaves identically to the old one when concurrency=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-14T11:35:31Z

Quality Gate failed

Failed conditions
0.0% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

freemans13 · 2026-05-19T12:37:22Z

Closing as superseded by #893, which landed essentially the same fix (concurrent fan-out of BatchPreviousOutputsDecorate via errgroup).

One thing worth flagging from this PR's history: #893 bounds the fan-out by UtxoStore.OutpointBatcherSize (default 1024), whereas this PR deliberately used a new dedicated knob BatchPreviousOutputsDecorateConcurrency with default 4 to avoid the legacy-sync OOM seen when an earlier attempt (#723) ran at 1024. Operators running legacy sync should keep an eye on RAM on the #893 path; a follow-up to either lower the default or introduce a dedicated cap may be worth considering.

The benchmark added here (stores/utxo/aerospike/batch_decorate_bench_test.go) isn't in #893 — happy to port it across in a separate PR if that's useful.

freemans13 mentioned this pull request May 14, 2026

Add opt-in drain mode for the outpoint batcher (off by default) #865

Merged

freemans13 self-assigned this May 14, 2026

freemans13 changed the title ~~perf(utxo/aerospike): parallelise BatchPreviousOutputsDecorate~~ Look up parent transactions in parallel instead of one at a time May 19, 2026

freemans13 closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Look up parent transactions in parallel instead of one at a time#864

Look up parent transactions in parallel instead of one at a time#864
freemans13 wants to merge 2 commits into
bsv-blockchain:mainfrom
freemans13:stu/aerospike-batch-prev-outputs-concurrent

freemans13 commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Uh oh!

freemans13 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

freemans13 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this changes

How it works today

The fix

Benchmark — proof it's actually faster

Why is the default only 4 if cap=64 is 570× faster?

Tests

How this relates to #865

Context

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

sonarqubecloud Bot commented May 14, 2026

Quality Gate failed

Uh oh!

freemans13 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

freemans13 commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading

github-actions Bot commented May 14, 2026 •

edited

Loading