Skip to content

fix(blockvalidation): arena-backed tx decode to eliminate catch-up OOM#929

Merged
oskarszoon merged 10 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/teranode-blockvalid-mem
May 22, 2026
Merged

fix(blockvalidation): arena-backed tx decode to eliminate catch-up OOM#929
oskarszoon merged 10 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/teranode-blockvalid-mem

Conversation

@oskarszoon

@oskarszoon oskarszoon commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes #920. Eliminates the per-script make([]byte, l) allocation hotspot in go-bt tx decode that was responsible for ~70% of the heap during testnet catch-up sync (causing OOM at ~13 GiB RSS around block ~5000).

Uses the new bt.Arena + Tx.ReadFromWithArena + Tx.HashTxIDInto APIs landed in go-bt v2.6.4 (bsv-blockchain/go-bt#146).

What's in this PR

  • Bumps go-bt to v2.6.4 (introduces bt.Arena, *.ReadFromWithArena, Tx.HashTxIDInto, and a 1 GiB script-length cap on existing ReadFrom paths).
  • New per-service sync.Pool of bt.Arena in subtreevalidation, blockpersister, asset/repository. Get on decode start, Put on completion via ResetAndShrink(64 MiB) so a one-off large decode doesn't bloat idle pool footprint.
  • Migrated hot decode paths:
    • subtreevalidation/check_block_subtrees.go: readTransactionsFromSubtreeDataStream is the catch-up critical loop. Per-subtree arena allocated in each parallel goroutine, kept alive in a batchArenas slice, returned to the pool only after processTransactionsInLevels consumes the batch's txs.
    • blockpersister/streaming_process_subtree.go: per-subtree arena around ProcessSubtreeUTXOStreaming's decode loop.
    • asset/repository/GetLegacyBlock.go: per-block arena around the streaming-reconstruction loop, with arena.Reset() between successful tx.WriteTo calls — peak bounded by the largest single tx, not the cumulative block size.
  • Hashing migrated to tx.HashTxIDInto (reusable scratch buffer) in the bulk-stream sites — eliminates the tx.Bytes() per-call allocation that contributed 524 MB to the original heap profile.
  • Single-tx call sites that retain *bt.Tx past the function frame (subtreevalidation.readTxFromReader, legacy/netsync/handle_block.go coinbase decode, legacy/netsync/manager.go inbound tx decode) intentionally stay on the standard bt.NewTxFromBytes / tx.ReadFrom path with an inline comment explaining why.

Performance

TestReadTransactionsFromSubtreeDataStream_MemoryBoundWithArena (synthetic, -short skipped):

  • Wire size: 97 MiB (1000 txs × 100 KB OP_RETURN each)
  • Post-decode HeapInuse delta: 28 MiB (held in arena slab + tx struct overhead)
  • Without arena, the delta would scale 1:1 with wire size

Concrete heap-bound: arena_pool uses ResetAndShrink(64 MiB). With up to 8 concurrent catch-up workers (per #911), idle pool baseline ≤ 512 MiB; peak ≤ N × max-subtree-script-bytes + struct overhead.

Backwards compatibility

  • go-bt v2.6.4 is additive — no breaking changes for any consumer.
  • Migrated functions in teranode: processSubtreeDataStream and extractAndCollectTransactions gained a *bt.Arena parameter; all internal callers updated. Test files pass nil to preserve the heap-allocating path.
  • All public method signatures on *Server remain stable.

Verification

  • go vet ./... on touched packages — clean
  • go test -count=1 -tags testtxmetacache ./... on touched packages — pass
  • go test -count=1 -race -tags testtxmetacache ./services/subtreevalidation/... ./services/blockpersister/... — pass
  • TestArenaPool_GetPutLifecycle, TestArenaPool_ShrinkAfterLargeUse — verify pool contract
  • TestReadTransactionsFromSubtreeDataStream_MemoryBoundWithArena — 28 MiB delta on 97 MiB stream
  • Manual: deploy to bsva-ovh-teranode-ttn-eu-3, restart fresh, catch up past block ~5000, confirm RSS < 8 GiB sustained and go-bt.(*Output).ReadFrom no longer in heap top-10 inuse.

Upstream

go-bt PR: bsv-blockchain/go-bt#146 (merged + released as v2.6.4)

Brings in bt.Arena, *.ReadFromWithArena, Tx.HashTxIDInto, and a
script-length cap (1 GiB) on existing ReadFrom paths. No teranode
code uses the new API yet; subsequent commits migrate hot decode
paths in subtreevalidation, blockpersister, asset, and legacy/netsync
to per-subtree/per-block arena pools.
sync.Pool of bt.Arena instances reused across subtree decode loops.
Get on decode start, Put on completion — putSubtreeArena runs
ResetAndShrink(64 MiB) so a one-off oversized decode doesn't bloat
the pool's idle footprint.

Hot-path migration in subsequent commits.
readTransactionsFromSubtreeDataStream now decodes via tx.ReadFromWithArena
and hashes via tx.HashTxIDInto with a reusable scratch buffer. Each
parallel subtree-decode goroutine gets its own arena via the sync.Pool
helper added in the previous commit; arenas are released after
processTransactionsInLevels consumes the batch's txs.

processSubtreeDataStream and extractAndCollectTransactions gain a
*bt.Arena parameter to thread the per-batch arena through.

Catches the catch-up critical hot path that bsv-blockchain#920 identified.
The per-call arena pattern would force a defensive copy of script bytes
before return, defeating the win. Bulk-stream sites (the catch-up hot
loop in check_block_subtrees.go) are where the arena pays off; one-shot
missing-tx fetches stay on the standard ReadFrom path.
ProcessSubtreeUTXOStreaming now decodes via tx.ReadFromWithArena and
hashes the integrity check via tx.HashTxIDInto with a reusable scratch.
A new per-package sync.Pool of bt.Arena instances is allocated once at
loop start and Put when the loop returns — tx pointers are consumed
inside each iteration (utxoDiff.ProcessTx copies fields into a UTXO
struct that is then serialised + written), so the arena lifetime
contract is satisfied.
GetLegacyBlock's inner tx streaming loop now decodes via
tx.ReadFromWithArena and calls arena.Reset between txs so peak memory
is bounded by the largest single tx, not the cumulative block size.
The arena is acquired once at the top of the streaming goroutine and
returned to the pool on exit.
handle_block.go's coinbase decode and manager.go's inbound tx decode
are one-shot operations where the *bt.Tx must outlive the function
frame. An arena allocated here would have to be Put before return,
aliasing the tx's script slices to soon-to-be-reused memory.

The per-block tx loops in legacy ingestion work with bsvutil.Tx (the
legacy wire wrapper, not bt.Tx) and never round-trip through go-bt
decode — the eventual heavy decode of those subtree-resident txs runs
in services/subtreevalidation, which already uses the per-subtree
arena from earlier in this PR series.
Synthetic 100 MiB subtree (1000 txs × 100 KB OP_RETURN each) decoded
via the arena path. Asserts post-decode HeapInuse delta is well under
the wire size — proving that arena reuse bounds peak memory by the
slab cap + per-tx struct overhead, not by cumulative script bytes.

Observed: 97 MiB wire, 28 MiB HeapInuse delta (vs ~97 MiB without arena).

Skipped under -short.
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Current Review:

No new issues found. The PR implements arena-backed transaction decoding correctly with proper lifecycle management across all paths.

Previously Reported:

  • Arena lifetime concern in blockpersister/streaming_process_subtree.go (line 284) — addressed by developer with verification that UTXOSet.ProcessTx creates heap copies via append(b, u.Script...) in UTXO.Bytes():423

Notes:

  • All three arena pools (subtreevalidation, blockpersister, asset/repository) follow consistent patterns with proper ResetAndShrink on Put
  • Error paths correctly release arenas before early returns
  • Single-tx decode sites correctly stay on heap path with explanatory comments
  • Memory-bound test validates 97 MiB stream → 28 MiB held (arena working as designed)

Comment thread services/blockpersister/streaming_process_subtree.go
CI bot flagged a use-after-free concern keyed on UTXODiff.Add, but
ProcessSubtreeUTXOStreaming uses *utxopersister.UTXOSet (not the
unrelated services/blockpersister/utxoset/model.UTXODiff type, which
is only imported by its own internal tests). UTXOSet.ProcessTx
serialises via UTXOWrapper.Bytes -> UTXO.Bytes, which heap-copies
script bytes via append, then writes through a bufio.Writer that
copies into its own buffer. No arena-backed slice survives the
function frame. Adding the comment so the next reader doesn't have
to retrace the chain.
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-929 (b4c5760)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.759µ 1.829µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 59.60n 59.62n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 59.35n 59.41n ~ 0.500
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 59.32n 59.42n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 33.16n 38.05n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 55.92n 60.26n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 157.2n 160.3n ~ 0.400
MiningCandidate_Stringify_Short-4 255.0n 259.1n ~ 1.000
MiningCandidate_Stringify_Long-4 1.814µ 1.809µ ~ 1.000
MiningSolution_Stringify-4 936.2n 957.1n ~ 0.700
BlockInfo_MarshalJSON-4 1.793µ 1.762µ ~ 0.700
NewFromBytes-4 127.6n 128.6n ~ 0.200
AddTxBatchColumnar_Validation-4 2.499µ 2.589µ ~ 0.100
OffsetValidationLoop-4 640.6n 638.2n ~ 0.400
Mine_EasyDifficulty-4 60.66µ 61.02µ ~ 0.200
Mine_WithAddress-4 7.605µ 6.825µ ~ 0.100
BlockAssembler_AddTx-4 0.02642n 0.03266n ~ 0.100
AddNode-4 10.48 11.19 ~ 0.100
AddNodeWithMap-4 11.77 11.79 ~ 1.000
DiskTxMap_SetIfNotExists-4 3.485µ 3.595µ ~ 0.400
DiskTxMap_SetIfNotExists_Parallel-4 3.249µ 3.278µ ~ 0.700
DiskTxMap_ExistenceOnly-4 314.6n 316.4n ~ 1.000
Queue-4 188.6n 189.0n ~ 1.000
AtomicPointer-4 4.836n 5.199n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 863.5µ 868.3µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 817.3µ 804.3µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 103.2µ 103.7µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/10K-4 62.27µ 62.71µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/10K-4 56.18µ 55.47µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 11.64µ 11.79µ ~ 0.400
ReorgOptimizations/NodeFlags/Old/10K-4 4.822µ 4.778µ ~ 1.000
ReorgOptimizations/NodeFlags/New/10K-4 1.623µ 1.604µ ~ 0.700
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.031m 9.767m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.41m 10.13m ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.085m 1.090m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 685.4µ 688.7µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/100K-4 559.3µ 544.7µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 305.4µ 340.4µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/100K-4 52.19µ 48.62µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 18.24µ 17.25µ ~ 0.100
TxMapSetIfNotExists-4 52.73n 52.59n ~ 1.000
TxMapSetIfNotExistsDuplicate-4 40.61n 40.32n ~ 0.100
ChannelSendReceive-4 629.5n 589.7n ~ 0.100
DirectSubtreeAdd/4_per_subtree-4 57.36n 56.90n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 29.17n 28.96n ~ 0.600
DirectSubtreeAdd/256_per_subtree-4 27.83n 27.79n ~ 0.700
DirectSubtreeAdd/1024_per_subtree-4 26.55n 26.50n ~ 0.700
DirectSubtreeAdd/2048_per_subtree-4 26.19n 26.03n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 295.0n 294.3n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 285.2n 287.6n ~ 0.400
SubtreeProcessorAdd/256_per_subtree-4 287.3n 286.0n ~ 1.000
SubtreeProcessorAdd/1024_per_subtree-4 277.8n 278.5n ~ 0.400
SubtreeProcessorAdd/2048_per_subtree-4 279.9n 278.7n ~ 0.400
SubtreeProcessorRotate/4_per_subtree-4 287.9n 283.5n ~ 0.700
SubtreeProcessorRotate/64_per_subtree-4 279.1n 280.2n ~ 0.700
SubtreeProcessorRotate/256_per_subtree-4 280.2n 281.5n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 281.2n 281.9n ~ 0.600
SubtreeNodeAddOnly/4_per_subtree-4 55.47n 55.88n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 36.19n 36.13n ~ 0.200
SubtreeNodeAddOnly/256_per_subtree-4 35.29n 35.24n ~ 0.600
SubtreeNodeAddOnly/1024_per_subtree-4 34.51n 34.82n ~ 0.400
SubtreeCreationOnly/4_per_subtree-4 110.7n 112.2n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 355.1n 360.0n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.236µ 1.247µ ~ 0.300
SubtreeCreationOnly/1024_per_subtree-4 3.908µ 3.895µ ~ 1.000
SubtreeCreationOnly/2048_per_subtree-4 6.961µ 7.026µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 282.5n 280.7n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 280.7n 280.6n ~ 0.500
ParallelGetAndSetIfNotExists/1k_nodes-4 2.021m 2.030m ~ 1.000
ParallelGetAndSetIfNotExists/10k_nodes-4 5.165m 5.195m ~ 0.200
ParallelGetAndSetIfNotExists/50k_nodes-4 7.254m 7.605m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 10.05m 10.82m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.785m 1.803m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 4.479m 4.686m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 14.69m 13.96m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 27.16m 27.23m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.092m 2.079m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.523m 8.420m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.89m 13.70m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.812m 1.872m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 8.414m 8.748m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 44.08m 47.31m ~ 0.400
CalcBlockWork-4 470.2n 468.5n ~ 1.000
CalculateWork-4 643.4n 638.3n ~ 0.400
BuildBlockLocatorString_Helpers/Size_10-4 1.331µ 1.670µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_100-4 12.87µ 12.94µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_1000-4 151.6µ 128.1µ ~ 0.100
CatchupWithHeaderCache-4 104.4m 104.5m ~ 0.400
_prepareTxsPerLevel-4 424.1m 430.0m ~ 0.200
_prepareTxsPerLevelOrdered-4 3.801m 4.013m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 420.1m 423.2m ~ 0.400
_prepareTxsPerLevel_Comparison/Optimized-4 3.793m 3.890m ~ 0.100
_BufferPoolAllocation/16KB-4 3.727µ 3.807µ ~ 0.200
_BufferPoolAllocation/32KB-4 8.396µ 8.523µ ~ 0.700
_BufferPoolAllocation/64KB-4 17.54µ 14.33µ ~ 0.200
_BufferPoolAllocation/128KB-4 35.04µ 25.20µ ~ 0.100
_BufferPoolAllocation/512KB-4 97.06µ 102.89µ ~ 0.100
_BufferPoolConcurrent/32KB-4 18.23µ 18.26µ ~ 1.000
_BufferPoolConcurrent/64KB-4 29.15µ 27.98µ ~ 0.100
_BufferPoolConcurrent/512KB-4 145.1µ 146.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 699.0µ 689.4µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/32KB-4 710.1µ 701.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 687.2µ 678.4µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/128KB-4 626.4µ 621.2µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/512KB-4 624.4µ 669.4µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.29m 36.78m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.37m 36.91m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 35.86m 36.76m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.11m 36.71m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.53m 36.77m ~ 0.100
_PooledVsNonPooled/Pooled-4 834.1n 832.2n ~ 1.000
_PooledVsNonPooled/NonPooled-4 7.149µ 7.754µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.558µ 7.517µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.201µ 11.681µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.132µ 11.988µ ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.276m 1.291m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 304.7µ 313.2µ ~ 0.100
SubtreeSizes/10k_tx_64_per_subtree-4 74.16µ 73.17µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 18.35µ 17.91µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 8.926µ 8.915µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.443µ 4.392µ ~ 0.200
SubtreeSizes/10k_tx_2k_per_subtree-4 2.214µ 2.178µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 70.70µ 69.37µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 17.72µ 17.66µ ~ 1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.377µ 4.370µ ~ 1.000
BlockSizeScaling/50k_tx_64_per_subtree-4 371.9µ 367.7µ ~ 0.400
BlockSizeScaling/50k_tx_256_per_subtree-4 88.19µ 90.93µ ~ 0.100
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.55µ 21.69µ ~ 1.000
SubtreeAllocations/small_subtrees_exists_check-4 152.4µ 152.2µ ~ 1.000
SubtreeAllocations/small_subtrees_data_fetch-4 162.2µ 162.7µ ~ 1.000
SubtreeAllocations/small_subtrees_full_validation-4 312.8µ 313.1µ ~ 0.700
SubtreeAllocations/medium_subtrees_exists_check-4 9.030µ 9.194µ ~ 0.200
SubtreeAllocations/medium_subtrees_data_fetch-4 9.457µ 9.509µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 17.76µ 18.43µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.107µ 2.163µ ~ 0.200
SubtreeAllocations/large_subtrees_data_fetch-4 2.286µ 2.290µ ~ 1.000
SubtreeAllocations/large_subtrees_full_validation-4 4.466µ 4.568µ ~ 0.400
StoreBlock_Sequential/BelowCSVHeight-4 333.4µ 329.3µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 329.2µ 333.2µ ~ 0.100
GetUtxoHashes-4 274.4n 275.4n ~ 0.800
GetUtxoHashes_ManyOutputs-4 45.43µ 45.75µ ~ 0.700
_NewMetaDataFromBytes-4 216.1n 216.5n ~ 0.800
_Bytes-4 397.0n 404.7n ~ 0.100
_MetaBytes-4 138.4n 140.1n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-22 08:39 UTC

@sonarqubecloud

Copy link
Copy Markdown

@oskarszoon oskarszoon changed the title fix(blockvalidation): arena-backed tx decode to eliminate catch-up OOM (#920) fix(blockvalidation): arena-backed tx decode to eliminate catch-up OOM May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

blockvalidation: OOM during ttn catch-up sync — 70% of heap in go-bt tx/output decode

3 participants