Skip to content

perf(utxo): eliminate encode-side allocations in the store Create path (#1002)#1011

Merged
oskarszoon merged 12 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/go-bt-encode
Jun 2, 2026
Merged

perf(utxo): eliminate encode-side allocations in the store Create path (#1002)#1011
oskarszoon merged 12 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/go-bt-encode

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

What

Cuts avoidable heap allocations in the Aerospike utxo-store Create path — the live source of the encode-side churn from #1002. The repeated TxIDChainHash re-serialization the issue describes is already neutralised on the legacy path (createTxMapSetTxHash); the remaining churn is per-tx/per-output/per-input serialization inside GetBinsToStore.

teranode-only, no go-bt change (v2.6.4 already ships the zero-alloc primitives). No stored byte format, ExtendedSize bin value, or hashing semantics change — output is byte-identical.

Changes

  • util.UTXOHashInto(scratch, ...) — scratch-reusing, zero-alloc UTXO hash. UTXOHash/UTXOHashFromInput/UTXOHashFromOutput become thin wrappers (signatures unchanged).
  • GetUtxoHashes / GetFeesAndUtxoHashes — allocations now O(1) in output count (contiguous []chainhash.Hash backing + reused scratch) instead of O(N). Signatures unchanged.
  • Sizing: output.Size() instead of len(output.Bytes()); arithmetic extendedTxSize(tx) instead of len(tx.ExtendedBytes()) at both measure-only sites — removes two full-tx serializations done purely to read a length.
  • appendOutputInto / appendInputExtendedInto — zero-alloc serialization into a pooled per-batch bt.Arena (mirrors the fix(blockvalidation): arena-backed tx decode to eliminate catch-up OOM #929 subtreevalidation/arena_pool.go convention), reset after BatchOperate. The multi-record / external escape paths rebuild bins heap-owned so the arena reset can't corrupt the fire-and-forget goroutines that outlive sendStoreBatch.

Verification

  • Byte-equivalence (property/golden): appendOutputInto == Output.Bytes(), appendInputExtendedInto == prior manual layout, extendedTxSize == len(ExtendedBytes()) (incl. >252B scripts, coinbase, decoded-tx shape), UTXOHashInto == UTXOHash, and arena-vs-nil bins byte-identical across coinbase / multi-output / OP_RETURN / no-input shapes.
  • Alloc gates: UTXOHashInto 0 allocs/op with scratch; GetUtxoHashes flat in output count (195→4 allocs for 64 outputs); BenchmarkAppendOutputInto_Arena 0 allocs/op.
  • Lifetime: -race test on concurrent arena reuse + the escape-rebuild path. Arena bytes are only referenced until BatchOperate (NewBytesValue does not copy); reset is deferred to after it returns.
  • go build ./... clean; touched-package tests + go vet + staticcheck green.

Test plan

  • CI: full aerospike testcontainer suite — the integration tests need a live Aerospike + grpc endpoints and don't run in a bare local env.
  • Post-merge profile on a busy legacy node during IBD: appendTo + toBytesHelper + Output.Bytes should drop from ~50% to <20% of cumulative alloc_space (the go-bt encode-side allocations still 50% of legacy alloc churn after #929 (decode arena) #1002 acceptance criterion — mainnet IBD isn't reproducible locally, so the alloc-gate benchmarks stand in for it pre-merge).

Addresses #1002.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

No issues found. The PR implements a clean performance optimization with excellent test coverage and proper safety mechanisms.

History:

  • ✅ Fixed: Previously flagged concern about nil previousTxIDHash handling has been addressed with correction logic and test coverage in commit 7cb5579

Comment thread stores/utxo/aerospike/create.go
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1011 (39aff4c)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.630µ 1.665µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.39n 71.08n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.28n 71.36n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.07n 71.31n ~ 0.300
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 32.77n 32.48n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 56.37n 54.62n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 127.4n 125.8n ~ 0.200
MiningCandidate_Stringify_Short-4 219.1n 219.2n ~ 1.000
MiningCandidate_Stringify_Long-4 1.631µ 1.629µ ~ 1.000
MiningSolution_Stringify-4 847.3n 852.3n ~ 0.100
BlockInfo_MarshalJSON-4 1.737µ 1.751µ ~ 0.200
NewFromBytes-4 124.6n 125.9n ~ 1.000
AddTxBatchColumnar_Validation-4 2.503µ 2.503µ ~ 1.000
OffsetValidationLoop-4 545.0n 550.2n ~ 1.000
Mine_EasyDifficulty-4 60.53µ 61.75µ ~ 0.200
Mine_WithAddress-4 7.204µ 6.933µ ~ 0.200
DiskTxMap_SetIfNotExists-4 3.521µ 3.599µ ~ 0.700
DiskTxMap_SetIfNotExists_Parallel-4 3.434µ 3.308µ ~ 0.100
DiskTxMap_ExistenceOnly-4 318.4n 312.1n ~ 0.100
Queue-4 196.3n 190.2n ~ 0.100
AtomicPointer-4 4.757n 4.421n ~ 0.400
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 931.5µ 879.4µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 793.5µ 801.5µ ~ 0.200
ReorgOptimizations/AllMarkFalse/Old/10K-4 106.9µ 108.4µ ~ 1.000
ReorgOptimizations/AllMarkFalse/New/10K-4 62.29µ 61.97µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/10K-4 58.91µ 61.29µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.59µ 11.48µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 4.980µ 4.639µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.838µ 1.577µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.188m 9.996m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.22m 10.06m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.141m 1.197m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 684.6µ 679.6µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 625.8µ 560.0µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 298.5µ 302.4µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/100K-4 49.33µ 49.85µ ~ 1.000
ReorgOptimizations/NodeFlags/New/100K-4 17.43µ 18.26µ ~ 0.100
TxMapSetIfNotExists-4 52.36n 52.78n ~ 0.100
TxMapSetIfNotExistsDuplicate-4 40.64n 39.90n ~ 0.100
ChannelSendReceive-4 614.8n 644.3n ~ 0.100
BlockAssembler_AddTx-4 0.02634n 0.02754n ~ 0.700
AddNode-4 11.05 11.48 ~ 0.200
AddNodeWithMap-4 11.84 12.29 ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 56.41n 59.78n ~ 0.700
DirectSubtreeAdd/64_per_subtree-4 29.48n 28.90n ~ 0.200
DirectSubtreeAdd/256_per_subtree-4 27.72n 27.82n ~ 0.300
DirectSubtreeAdd/1024_per_subtree-4 26.49n 26.53n ~ 0.400
DirectSubtreeAdd/2048_per_subtree-4 26.14n 26.14n ~ 0.800
SubtreeProcessorAdd/4_per_subtree-4 292.1n 293.7n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 286.5n 286.3n ~ 0.700
SubtreeProcessorAdd/256_per_subtree-4 285.1n 284.3n ~ 0.700
SubtreeProcessorAdd/1024_per_subtree-4 276.8n 274.5n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 277.0n 276.7n ~ 0.700
SubtreeProcessorRotate/4_per_subtree-4 285.2n 280.3n ~ 0.400
SubtreeProcessorRotate/64_per_subtree-4 285.4n 279.9n ~ 0.400
SubtreeProcessorRotate/256_per_subtree-4 277.6n 278.9n ~ 0.700
SubtreeProcessorRotate/1024_per_subtree-4 280.1n 278.0n ~ 0.700
SubtreeNodeAddOnly/4_per_subtree-4 55.34n 55.00n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 36.19n 36.15n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 35.15n 35.21n ~ 1.000
SubtreeNodeAddOnly/1024_per_subtree-4 34.63n 34.55n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 110.3n 109.8n ~ 0.200
SubtreeCreationOnly/64_per_subtree-4 350.9n 347.3n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.228µ 1.235µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 3.792µ 3.824µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 6.793µ 6.840µ ~ 0.700
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 284.7n 279.4n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 281.7n 284.0n ~ 0.400
ParallelGetAndSetIfNotExists/1k_nodes-4 2.010m 2.023m ~ 0.700
ParallelGetAndSetIfNotExists/10k_nodes-4 5.221m 5.323m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 7.323m 7.287m ~ 1.000
ParallelGetAndSetIfNotExists/100k_nodes-4 10.005m 9.562m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.805m 1.821m ~ 1.000
SequentialGetAndSetIfNotExists/10k_nodes-4 4.473m 4.921m ~ 0.200
SequentialGetAndSetIfNotExists/50k_nodes-4 13.88m 13.45m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 25.34m 24.73m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.076m 2.065m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.451m 8.503m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.60m 13.46m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.828m 1.826m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 8.236m 8.308m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 46.27m 43.79m ~ 0.100
CalcBlockWork-4 544.3n 532.4n ~ 0.100
CalculateWork-4 737.2n 759.1n ~ 1.000
BuildBlockLocatorString_Helpers/Size_10-4 1.353µ 1.359µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.05µ 13.34µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 159.7µ 130.5µ ~ 0.700
CatchupWithHeaderCache-4 104.5m 104.5m ~ 0.400
_prepareTxsPerLevel-4 406.5m 411.7m ~ 0.700
_prepareTxsPerLevelOrdered-4 3.528m 3.892m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 412.5m 419.2m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 3.450m 3.661m ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.415m 1.400m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 326.8µ 322.3µ ~ 0.400
SubtreeSizes/10k_tx_64_per_subtree-4 79.29µ 76.71µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 19.86µ 19.42µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 9.801µ 9.577µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 4.848µ 4.773µ ~ 0.200
SubtreeSizes/10k_tx_2k_per_subtree-4 2.425µ 2.375µ ~ 0.200
BlockSizeScaling/10k_tx_64_per_subtree-4 77.70µ 76.84µ ~ 0.200
BlockSizeScaling/10k_tx_256_per_subtree-4 19.59µ 19.47µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.891µ 4.740µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 410.3µ 395.2µ ~ 0.200
BlockSizeScaling/50k_tx_256_per_subtree-4 96.66µ 96.03µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 24.01µ 23.88µ ~ 1.000
SubtreeAllocations/small_subtrees_exists_check-4 163.6µ 168.3µ ~ 0.100
SubtreeAllocations/small_subtrees_data_fetch-4 171.6µ 166.6µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 337.8µ 332.7µ ~ 0.700
SubtreeAllocations/medium_subtrees_exists_check-4 9.559µ 9.579µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 10.164µ 9.764µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 19.53µ 19.29µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.322µ 2.307µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.473µ 2.364µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.904µ 4.840µ ~ 0.100
_BufferPoolAllocation/16KB-4 4.277µ 5.506µ ~ 0.100
_BufferPoolAllocation/32KB-4 10.883µ 8.707µ ~ 0.200
_BufferPoolAllocation/64KB-4 17.35µ 16.88µ ~ 0.100
_BufferPoolAllocation/128KB-4 35.38µ 34.30µ ~ 0.100
_BufferPoolAllocation/512KB-4 144.3µ 141.7µ ~ 0.200
_BufferPoolConcurrent/32KB-4 23.83µ 20.55µ ~ 0.200
_BufferPoolConcurrent/64KB-4 31.06µ 31.11µ ~ 0.700
_BufferPoolConcurrent/512KB-4 157.7µ 159.3µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 728.3µ 664.3µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/32KB-4 745.3µ 660.1µ ~ 0.200
_SubtreeDeserializationWithBufferSizes/64KB-4 745.8µ 673.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 727.6µ 656.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 619.4µ 630.7µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 37.91m 37.84m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/32KB-4 37.32m 37.18m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/64KB-4 37.23m 37.19m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/128KB-4 37.41m 36.74m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.97m 36.92m ~ 0.400
_PooledVsNonPooled/Pooled-4 837.9n 832.6n ~ 0.100
_PooledVsNonPooled/NonPooled-4 8.237µ 8.505µ ~ 1.000
_MemoryFootprint/Current_512KB_32concurrent-4 7.698µ 7.337µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 10.59µ 10.13µ ~ 0.200
_MemoryFootprint/Alternative_64KB_32concurrent-4 10.473µ 9.987µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 335.1µ 339.6µ ~ 0.200
StoreBlock_Sequential/AboveCSVHeight-4 343.1µ 341.1µ ~ 1.000
GetUtxoHashes-4 274.2n 282.9n ~ 0.100
GetUtxoHashes_ManyOutputs-4 45.31µ 46.96µ ~ 0.100
_NewMetaDataFromBytes-4 215.1n 214.8n ~ 1.000
_Bytes-4 399.4n 396.1n ~ 0.700
_MetaBytes-4 140.2n 137.5n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-06-01 20:52 UTC

@sonarqubecloud

sonarqubecloud Bot commented Jun 1, 2026

Copy link
Copy Markdown

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. Verified byte-equivalence of all hand-rolled serializers against go-bt v2.6.4 source (appendOutputInto ≡ Output.Bytes, appendInputExtendedInto ≡ Input.Bytes + extended suffix, extendedTxSize ≡ len(ExtendedBytes) including the −32 nil-txid correction, VarInt.AppendTo ≡ Bytes, UTXOHashInto preimage unchanged).

Arena lifetime is safe: Arena.Alloc's grow path orphans the old slab (copies, doesn't overwrite), so earlier bins stay valid if the arena grows mid-batch; deferred putCreateArena fires only after BatchOperate returns; multi-record and RECORD_TOO_BIG escape paths rebuild with nil arena before handing bins to goroutines.

Local checks green: go build, go vet, gci diff (clean), go test -race on concurrent arena reuse, and the new encode/arena/hash unit tests.

Non-blocking: consider a one-line comment at the appendInputExtendedInto size calc noting the deliberate 32-byte over-allocation for nil previousTxIDHash (mirrors the note in extendedTxSize). Please ensure the full Aerospike testcontainer CI suite runs green before merge.

@oskarszoon oskarszoon merged commit b9c8cc4 into bsv-blockchain:main Jun 2, 2026
36 of 37 checks passed
freemans13 added a commit to freemans13/teranode that referenced this pull request Jun 2, 2026
Upstream bsv-blockchain#1011 (perf: eliminate encode-side allocations) added an
arena *bt.Arena parameter to GetBinsToStore. The batch-create path has no
per-batch arena to reuse, so it passes nil (heap-backed bins), matching
upstream's non-hot-path callers.
@oskarszoon oskarszoon self-assigned this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants