perf(aerospike): fan out BatchPreviousOutputsDecorate per-tx#893
Merged
Conversation
The previous implementation looped per-tx and called PreviousOutputsDecorate sequentially, which on legacy sync forced one OutpointBatcherDurationMillis (default 10 ms, 5 ms on prod) flush per transaction. Each per-tx call only queued ~2 inputs - far below OutpointBatcherSize (100/4096) - so the batcher always waited the full duration before firing. Observed on mainnet legacy sync at height ~381.5k: 2856 tx extendTx 14.34s 2820 tx extendTx 13.84s 2076 tx extendTx 11.18s 1620 tx extendTx 8.31s 997 tx extendTx 5.39s Wall time tracked 5 ms x N_tx almost exactly, confirming the per-tx timer floor as the bottleneck. CPU profile + goroutine dump showed exactly one PreviousOutputsDecorate active and zero sendOutpointBatch workers running during the wait - the aerospike batcher pool sat idle while the loop drained one tx at a time. Flatten every input across every tx into a single Put loop so the batcher fills by size for the bulk of the work, with at most one timer flush for the tail. Channel semantics are unchanged: each errChan is buffered cap 1 and sendErrorAndClose is non-blocking, so returning on first error never blocks in-flight workers. Expected: extendTransactions for a 2856-tx block drops from ~14 s to ~10-50 ms. The SQL store already implements this shape with BatchPreviousOutputsDecorateConcurrency + chunked parallel; this brings the aerospike store in line.
Contributor
|
🤖 Claude Code Review Status: Complete Current Review: The previously reported concurrency issue has been fixed. The code now properly includes Analysis:
No issues found. History:
|
Contributor
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-19 10:35 UTC |
Replace the prior flatten-into-one-Put-loop approach with the simpler fix: wrap the existing per-tx PreviousOutputsDecorate calls in an errgroup so all of them push into the shared outpoint batcher concurrently. The batcher fills by size from concurrent producers instead of idling at its per-tx duration timer waiting for ~2 inputs to accumulate. Same wall-time effect as the flatten variant, but the function body shrinks back to a few lines and we stop having two near-identical loops in the package (PreviousOutputsDecorate already owns the Put/drain logic). Follow-up to review feedback on the prior commit; supersedes that approach.
BatchPreviousOutputsDecorate previously had no direct test in the aerospike package - the prior implementation was a per-tx serial loop that delegated straight to PreviousOutputsDecorate, so coverage on that single-tx path was treated as sufficient. After switching to per-tx fan-out via errgroup, the function now runs concurrent producers against the shared outpoint batcher, which is exactly the path that was untested. Add an aerospike_BatchPreviousOutputsDecorate subtest under TestAerospike covering: - Multiple child txs decorated in one call - asserts every input across every child is populated with the correct satoshis / locking script. - Shared parent across children - same parent vout referenced by two children exercises the batcher's dedup of the underlying aerospike fetch even though goroutines race to request it. - Empty / nil tx slice - no-op. - Already-decorated input preserved - PreviousTxScript != nil inputs must not be overwritten (the early-skip in PreviousOutputsDecorate is what stops Phase 1's txMap work from being clobbered by Phase 2 in the legacy caller). Runs under -race so any data-race introduced by the goroutine fan-out into the per-input slot writes would be caught here.
Cap the per-tx errgroup at UtxoStore.OutpointBatcherSize (default 100, 4096 on prod overrides). The unbounded version was fine in practice - producers spend most of their time parked on errChan receives, and the real ceiling on aerospike round-trips is BatcherMaxConcurrent - but bounding the goroutine count: - Keeps memory predictable on pathologically large blocks (a 20k-tx block would otherwise spawn 20k goroutines, ~160 MB of stack at minimum, plus per-input channel allocations). - Mirrors the Phase 1 errgroup bound in the legacy caller (services/legacy/netsync/handle_block.go:980), so Phase 2 no longer silently 20x's the producer-side parallelism of Phase 1. - Matches the codebase pattern of bounding every errgroup that fans out per-tx or per-record work. Throughput is unchanged: the bound is well above the BatcherMaxConcurrent ceiling that actually gates aerospike batch throughput, and the upstream batcher fills by size as soon as a few hundred producers have pushed. Addresses review feedback on PR bsv-blockchain#893.
icellan
approved these changes
May 19, 2026
|
ordishs
approved these changes
May 19, 2026
This was referenced May 19, 2026
freemans13
added a commit
to freemans13/teranode
that referenced
this pull request
May 22, 2026
Updates from @ordishs and Copilot reviews: - Extend the comment above the drain-mode toggles in aerospike.go to cover the new outpoint case and the post-bsv-blockchain#893 nuance. - Rewrite the OutpointBatcherDrainMode longdesc so the 'when to enable / when to disable' guidance reflects the post-bsv-blockchain#893 fan-out path rather than the pre-bsv-blockchain#893 sequential producer. - Add TestAerospikeOutpointBatcherDrainMode happy-path smoke test asserting the toggle stays wired (per-tx + concurrent decorate).
freemans13
added a commit
to freemans13/teranode
that referenced
this pull request
Jun 2, 2026
…fan-out; align docs Add BenchmarkBatchPreviousOutputsDecorateDrainMode comparing the outpoint batcher's drain mode on vs off across the concurrent BatchPreviousOutputsDecorate path (bsv-blockchain#893 errgroup fan-out), using distinct parents per input so batch width maps to real BatchOperate record counts. Findings (count=10 + benchstat, distinct-parent inputs): drain mode is neutral-to-faster in the mean at every tx count (-8% at 2856 tx to -94% at 16 tx), but bimodal/heavy-tailed at mid tx counts (~64-256). So the concurrent node path keeps drain mode default off (predictable latency), and the clean win is the single-producer, separate-process cmd/rewindblockchain caller. Update the aerospike.go comment and the setting longdesc to match the measured behaviour and to note the batcher is store-wide (the toggle cannot be applied to one caller without affecting the other in the same process).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
stores/utxo/aerospike.(*Store).BatchPreviousOutputsDecoratelooped per-tx and called the per-tx decorate path serially. On legacy sync this forced oneOutpointBatcherDurationMillisflush per transaction (default 10 ms; 5 ms on our mainnet prod nodes) because each per-tx call only queued ~2 inputs - well belowOutpointBatcherSize- so the batcher always waited the full timer before firing.Observed on
bsva-ovh-teranode-eu-3mainnet legacy sync at height ~381.5k:extendTransactionswallWall time tracked
5 ms x N_txalmost exactly. CPU was ~16% of one core; the 30 s CPU profile + goroutine dump showed exactly onePreviousOutputsDecoratein flight and zerosendOutpointBatchworkers running while the loop waited - the aerospike batcher pool sat idle.Change
BatchPreviousOutputsDecoratenow wraps the per-txPreviousOutputsDecoratecalls in anerrgroupso they push into the shared outpoint batcher concurrently. The batcher fills by size from concurrent producers instead of idling at its per-tx duration timer.Existing per-tx behaviour (skip already-decorated inputs, dedup parent fetches across the batch, external-store cache) is unchanged - we just stop serialising the producers.
Expected impact
For a 2856-tx block:
extendTransactions~14 s -> a few aerospike round-trips (~10-50 ms).Callers that benefit:
services/legacy/netsync/handle_block.go:1029- block ingest phase 2services/blockvalidation/quick_validate.go:1081, 1419- catchup decorateTests added
aerospike_BatchPreviousOutputsDecoratesubtest underTestAerospikecovers:PreviousTxScript != nilinputs must not be overwritten (the early-skip inPreviousOutputsDecorateis what stops Phase 1's txMap work from being clobbered by Phase 2 in the legacy caller).The function had no direct test in the aerospike package before this PR.
Test plan
go build ./...cleango vet ./stores/utxo/aerospike/...cleango test -race -count=1 ./stores/utxo/aerospike/...- all subtests pass (524 total, +1 new)bsva-ovh-teranode-eu-3, confirmextendTransactions DONE inlog line drops by at least one order of magnitude on multi-thousand-tx blocks