Look up parent transactions in parallel instead of one at a time#864
Look up parent transactions in parallel instead of one at a time#864freemans13 wants to merge 2 commits into
Conversation
Aerospike's BatchPreviousOutputsDecorate ran PreviousOutputsDecorate sequentially per tx. PreviousOutputsDecorate enqueues each input into the shared outpointBatcher and then waits for its inputs' results; running it concurrently across txs lets the batcher coalesce inputs from many transactions into single Aerospike batch reads — the whole point of having a shared batcher. The sequential loop submitted one tx's inputs, drained them, then moved on, so the batcher only ever saw one tx's inputs at a time and sized batches accordingly. On large blocks that left most of the batcher's capacity unused on every flush. Fix: dispatch each tx in its own goroutine via errgroup, bounded by UtxoStore.BatchPreviousOutputsDecorateConcurrency (the same knob the SQL store already uses, default 4). The default is intentionally low — each goroutine is mostly waiting on batcher result channels, so small fan-out is enough to keep the batcher saturated without spawning thousands of goroutines on huge blocks. Originally bundled in bsv-blockchain#723; pulled out as a focused PR after bsv-blockchain#723's core change (CATCHINGBLOCKS quick-validation gating) was superseded by bsv-blockchain#847. The original PR used OutpointBatcherSize as the concurrency cap, which would default to 1024 — known to trigger OOM on legacy sync. Reusing the SQL store's existing setting keeps the operator surface identical and the default safe. Tests: - go build ./... - go vet ./stores/utxo/aerospike/... - go test -race -short ./stores/utxo/aerospike/ — 464 passed (TestStore_IncrementSpentRecords passes standalone; an earlier full-suite timeout was unrelated to this change) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🤖 Claude Code Review Status: Complete Current Review:
Summary: The implementation is sound: it properly bounds goroutines using Code follows AGENTS.md requirements for minimal changes, clear documentation, and correct error handling. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-14 11:38 UTC |
…s concurrent Direct head-to-head benchmark of the old sequential per-tx loop against the new concurrent dispatch, holding everything else equal. Uses the real go-batcher with production defaults (100 items / 10 ms timer) and stubs the Aerospike flush callback with a 500 µs simulated batch round-trip — modelling per-batch latency that's independent of batch size, which matches Aerospike's real behaviour. On a 1000-tx × 3-input block, Apple M3 Max: sequential 10.68 s/op (baseline) concurrent_1 10.72 s/op 1.00x (sanity check) concurrent_4 2.68 s/op 3.99x (PR default) concurrent_16 0.68 s/op 15.8x concurrent_64 0.019 s/op 570x concurrent_1 matches sequential almost exactly, confirming the new implementation behaves identically to the old one when concurrency=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
Closing as superseded by #893, which landed essentially the same fix (concurrent fan-out of One thing worth flagging from this PR's history: #893 bounds the fan-out by The benchmark added here ( |


What this changes
When a block arrives, we need to look up the parent transaction outputs (amount + locking script) for every input in every transaction. Today we do this one transaction at a time. This PR does several transactions in parallel.
How it works today
Aerospike has a batcher in front of it. Instead of firing one call per input, the batcher collects inputs and sends them off in a single batch when either:
Good design — but the old code fed the batcher one transaction at a time:
A typical tx only has 2 or 3 inputs, nowhere near the 100-input size cap. So the batcher always fell back to the 10 ms timer before flushing. Every transaction paid that 10 ms wait. On a 1000-tx block that's about 10 seconds of pure timer waiting, even if Aerospike itself is fast.
The fix
Dispatch each transaction in its own goroutine, bounded by
UtxoStore.BatchPreviousOutputsDecorateConcurrency(the same knob the SQL store already uses, default 4). Now several transactions feed inputs into the shared batcher at the same time, so:The default of 4 is intentionally cautious. Each goroutine is mostly blocked waiting for batcher results, so a small fan-out is enough to keep the batcher busy. Operators can dial it up.
Benchmark — proof it's actually faster
stores/utxo/aerospike/batch_decorate_bench_test.godrives the realBatchPreviousOutputsDecorateagainst the realgo-batcherwith production-default settings (100 items / 10 ms timer). The only thing faked is the Aerospike call itself — replaced with a 500 µs sleep that models a real Aerospike round-trip.1000-tx × 3-input block, Apple M3 Max:
Two things to take from this:
utxostore_batchPreviousOutputsDecorateConcurrency=1restores the old behaviour.Why is the default only 4 if cap=64 is 570× faster?
Two reasons:
OutpointBatcherSize=1024as the concurrency cap, which triggered a known legacy-sync OOM. Default 4 has no memory risk; operators with headroom can raise it.Tests
go build ./...cleango vet ./stores/utxo/aerospike/...cleango test -race -short ./stores/utxo/aerospike/— 464 passedHow this relates to #865
Both PRs attack the same 10 ms timer problem, at different layers:
They compose. Drain mode plus concurrent dispatch is faster than either alone.
Context
Originally bundled in #723. Pulled out as a focused PR after #723's core change (broadening
quickValidationModetoCATCHINGBLOCKS) was superseded by #847.