Skip to content

perf(legacy): drop cumulative-stats replay; right-size getdata InvList#876

Merged
oskarszoon merged 3 commits into
bsv-blockchain:mainfrom
oskarszoon:feat/legacy-perf
May 18, 2026
Merged

perf(legacy): drop cumulative-stats replay; right-size getdata InvList#876
oskarszoon merged 3 commits into
bsv-blockchain:mainfrom
oskarszoon:feat/legacy-perf

Conversation

@oskarszoon

Copy link
Copy Markdown
Contributor

Summary

Two perf fixes targeting legacy-sync CPU/alloc cost, found by analysing a probe captured on bsva-ovh-teranode-ttn-eu-4 mid-fat-block (height 1,276,417, 181,416 txs, v0.15.0-beta-13, 28h uptime). Captured profiles attributed ~42% of legacy CPU to a Prometheus histogram path that was replaying cumulative Aerospike bucket counts every 5s.

1. perf(aerospike): delta-record cumulative stats

util/aerospike.go:initStats polled client.Stats() every 5s and for each histogram bucket called histogram.Observe(value) int(count) times. count is Aerospike's cumulative-since-process-start bucket counter (see aerospike-client-go/v8/cluster.go:aggregateNodeStats — per-node getAndReset resets node-local but accumulates into a persistent clstr.stats[h]), so every refresh re-observed the entire history. Same bug for scalar counters (counter.Add(cumulative) instead of delta).

Fix tracks last reported value per Prometheus key (counter and per histogram bucket) and emits only the delta. Resets are tolerated (treat current as delta). The stats-reflection loop body was extracted into a unit-testable processAerospikeStats helper. Added a missing time.Sleep on the Stats() error path.

Side-by-side benchmark in util/aerospike_stats_test.go (10M-warmup cumulative, 50k per-refresh delta, 11 histograms × 24 buckets):

Bench ns/op
BenchmarkProcessAerospikeStatsSteadyState (delta) 127,200,370
BenchmarkProcessAerospikeStatsLegacyReplay (replay) 25,611,779,292

~200× per refresh. With a 5s configured interval the replay path took longer than the interval itself, so the goroutine never caught up.

Note: the underlying gocore.Stat.NewStat allocates &Stat{} before sync.Map.LoadOrStore lookup even when the stat already exists. This contributes ~18 GB cumulative alloc to tracing.UTracer.Start over a 28h run but is out of scope for this PR — would need an upstream ordishs/gocore patch or a local per-spanName cache in util/tracing.

2. perf(legacy): size getdata InvList to maxBlocks

fetchHeaderBlocks allocated MsgGetData.InvList with sm.headerList.Len() as the size hint (often 2000) but the loop already breaks at maxBlocks (1–20 via blockSizeTracker.calculateMaxInFlightBlocks — drops to 1 for blocks >2 GB). Over-sized cap meant ~16 KB pointer storage allocated when only one slot would ever be used. Switched the size hint to maxBlocks.

Test plan

  • go test -race ./util/ — 524 pass
  • go test -race ./services/legacy/netsync/ — 52 pass
  • go vet ./util/ ./services/legacy/netsync/
  • go build ./...
  • Three new correctness tests (TestProcessAerospikeStatsCumulativeCounterDelta, …HistogramBucketDelta, …HistogramResetHandling) verify counter and histogram deltas + reset tolerance
  • Benchmark BenchmarkProcessAerospikeStatsLegacyReplay preserves a copy of the previous loop body for ongoing regression checks
  • Deploy to TTN host and re-capture CPU profile to confirm the histogram path drops out of the top frames

The Stats() goroutine spawned by initStats was treating Aerospike's
cumulative bucket counts and counters as if they were per-refresh
samples. For each refresh (default 5s) it iterated every histogram
bucket and called histogram.Observe(value) int(count) times, where
count is monotonically growing since process start. On the legacy
service after 28h uptime this dominated CPU.

CPU profile from a probe captured mid-fat-block on testnet attributed
~42% of legacy CPU to prometheus.histogram.observe and friends along
this path. A side-by-side benchmark (10M-warmup cumulative, 50k
per-refresh delta, 11 histograms x 24 buckets) measures:

  BenchmarkProcessAerospikeStatsSteadyState    127 ms/op   (delta)
  BenchmarkProcessAerospikeStatsLegacyReplay   25,612 ms/op (replay)

i.e. ~200x faster per refresh; the previous path took longer than
the configured refresh interval so the goroutine never caught up.

Fix:
- Track the previously reported cumulative value per counter and per
  histogram bucket. Emit only (current - previous) to Prometheus.
- Tolerate cumulative drops by treating the new value as the delta
  (covers client restart).
- Extract the stats-reflection body into a unit-testable
  processAerospikeStats helper.
- Add a missing time.Sleep on the Stats() error path so transient
  errors do not produce a tight retry loop.

Tests: three correctness tests (counter delta, histogram delta,
reset handling) plus benchmarks demonstrating the saving.
fetchHeaderBlocks allocated the MsgGetData InvList with a size hint of
sm.headerList.Len() (often 2000) but the loop below breaks at
maxBlocks (1-20, scaled by dynamic in-flight tracker - down to 1 for
blocks >2 GB). The over-sized backing slice was ~16 KB of pointer
storage allocated repeatedly even when only one slot was ever used.

Use maxBlocks as the size hint, which matches the loop's actual upper
bound and shrinks the per-call allocation from ~16 KB to ~8-160 B.
@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Summary

This PR delivers two focused performance optimizations backed by profiling data and comprehensive testing. The Aerospike stats fix addresses a critical CPU bottleneck (200× improvement), and the InvList sizing change eliminates unnecessary allocations during legacy sync.

No blocking issues found. The implementation is correct, well-tested, and follows the repository's verification standards (race detection, comprehensive unit tests, benchmarks preserving legacy behavior for regression checks).

Code Quality

Correctness: Delta tracking correctly handles cumulative counters, including edge case of client restart/reset
Testing: Three targeted unit tests verify counter delta, histogram delta, and reset handling
Performance: Side-by-side benchmarks quantify the improvement (127ms vs 25,611ms)
Maintainability: Extracted processAerospikeStats into a testable function with clear documentation

The tests align with AGENTS.md test-first approach (behavioral tests written before refactoring), and the benchmark preserves the legacy code path for ongoing regression verification.

Notes

  • The PR correctly identifies the root cause: Aerospike returns cumulative-since-start counters, but the old code re-added the full cumulative value every refresh
  • InvList sizing: The comment at line 1458-1461 clearly explains why maxBlocks is correct (loop breaks early, so large initial capacity wastes memory)
  • Added missing time.Sleep on error path (line 423) prevents busy loop on stats fetch failure

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two performance issues found in legacy-sync profiling: (1) Aerospike stats reflection was re-observing cumulative bucket counts each 5s refresh (accounting for ~40% of legacy CPU), and (2) fetchHeaderBlocks was over-allocating the getdata InvList slice. The first fix tracks last-reported values per Prometheus key and emits deltas; the second right-sizes the InvList allocation to maxBlocks. The stats-processing loop is also extracted into a unit-testable helper, and a missing time.Sleep is added on the Stats() error path.

Changes:

  • Track per-key/per-bucket last cumulative values in util/aerospike.go and emit only deltas to Prometheus counters/histograms; extract processAerospikeStats; sleep on error.
  • Size MsgGetData.InvList to maxBlocks instead of sm.headerList.Len() in fetchHeaderBlocks.
  • Add unit tests and benchmarks (delta vs. legacy replay) for processAerospikeStats.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
util/aerospike.go Refactors stats reflection into processAerospikeStats with per-key delta tracking via two new safemaps; adds error-path sleep.
util/aerospike_stats_test.go New tests covering counter/histogram delta and reset handling, plus side-by-side benchmarks.
services/legacy/netsync/manager.go Sizes getdata InvList to maxBlocks to avoid oversized allocations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

github-actions Bot commented May 15, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-876 (3319fba)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.665µ 1.668µ ~ 0.600
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.69n 61.56n ~ 0.200
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.53n 61.65n ~ 0.300
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.69n 61.61n ~ 0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 30.44n 29.45n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 49.13n 49.90n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 105.0n 106.6n ~ 0.400
MiningCandidate_Stringify_Short-4 262.3n 266.8n ~ 0.100
MiningCandidate_Stringify_Long-4 1.889µ 1.879µ ~ 0.700
MiningSolution_Stringify-4 977.2n 962.5n ~ 0.100
BlockInfo_MarshalJSON-4 1.741µ 1.759µ ~ 0.300
NewFromBytes-4 154.6n 125.4n ~ 0.100
Mine_EasyDifficulty-4 51.82µ 51.61µ ~ 0.700
Mine_WithAddress-4 5.354µ 5.453µ ~ 0.200
BlockAssembler_AddTx-4 0.02982n 0.02833n ~ 0.700
AddNode-4 11.12 10.41 ~ 0.200
AddNodeWithMap-4 10.58 10.12 ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 61.59n 59.40n ~ 0.700
DirectSubtreeAdd/64_per_subtree-4 29.19n 29.19n ~ 1.000
DirectSubtreeAdd/256_per_subtree-4 27.74n 27.83n ~ 0.200
DirectSubtreeAdd/1024_per_subtree-4 26.53n 26.57n ~ 0.700
DirectSubtreeAdd/2048_per_subtree-4 26.23n 26.20n ~ 0.700
SubtreeProcessorAdd/4_per_subtree-4 280.1n 283.3n ~ 0.400
SubtreeProcessorAdd/64_per_subtree-4 273.7n 286.6n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 277.9n 291.1n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 265.6n 289.1n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 268.2n 287.6n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 270.3n 297.0n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 269.0n 291.8n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 268.5n 289.9n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 270.3n 291.1n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 54.96n 55.57n ~ 0.700
SubtreeNodeAddOnly/64_per_subtree-4 36.06n 35.99n ~ 0.700
SubtreeNodeAddOnly/256_per_subtree-4 35.03n 35.04n ~ 1.000
SubtreeNodeAddOnly/1024_per_subtree-4 34.42n 34.57n ~ 0.200
SubtreeCreationOnly/4_per_subtree-4 112.6n 110.1n ~ 0.200
SubtreeCreationOnly/64_per_subtree-4 355.7n 344.4n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.212µ 1.213µ ~ 0.900
SubtreeCreationOnly/1024_per_subtree-4 3.748µ 3.788µ ~ 0.200
SubtreeCreationOnly/2048_per_subtree-4 6.739µ 6.968µ ~ 0.400
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 269.5n 292.9n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 270.5n 294.2n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 552.0µ 620.4µ ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 1.372m 1.381m ~ 0.700
ParallelGetAndSetIfNotExists/50k_nodes-4 6.694m 6.605m ~ 0.200
ParallelGetAndSetIfNotExists/100k_nodes-4 13.18m 13.37m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 623.3µ 700.9µ ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 2.909m 2.959m ~ 0.400
SequentialGetAndSetIfNotExists/50k_nodes-4 11.06m 11.09m ~ 1.000
SequentialGetAndSetIfNotExists/100k_nodes-4 21.35m 21.35m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 597.4µ 675.5µ ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 4.599m 4.642m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 17.00m 16.85m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 666.1µ 762.4µ ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 6.254m 6.292m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 39.36m 41.72m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.798µ 3.962µ ~ 0.300
DiskTxMap_SetIfNotExists_Parallel-4 3.486µ 3.750µ ~ 0.100
DiskTxMap_ExistenceOnly-4 317.3n 355.3n ~ 0.100
Queue-4 186.0n 187.9n ~ 0.300
AtomicPointer-4 3.251n 3.234n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 810.7µ 792.6µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 769.5µ 763.2µ ~ 0.400
ReorgOptimizations/AllMarkFalse/Old/10K-4 103.7µ 103.4µ ~ 0.400
ReorgOptimizations/AllMarkFalse/New/10K-4 64.40µ 64.47µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/10K-4 50.69µ 49.02µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/10K-4 11.19µ 10.90µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/10K-4 4.647µ 4.395µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.603µ 1.469µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.754m 9.243m ~ 0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.22m 10.14m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.091m 1.076m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 707.6µ 705.4µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 516.7µ 472.8µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/100K-4 206.6µ 196.5µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/100K-4 48.26µ 45.58µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 16.81µ 15.92µ ~ 0.200
TxMapSetIfNotExists-4 47.23n 46.59n ~ 0.700
TxMapSetIfNotExistsDuplicate-4 39.00n 38.51n ~ 0.100
ChannelSendReceive-4 620.0n 589.2n ~ 0.100
CalcBlockWork-4 535.7n 521.2n ~ 0.700
CalculateWork-4 697.3n 697.5n ~ 1.000
BuildBlockLocatorString_Helpers/Size_10-4 1.293µ 1.286µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 12.21µ 12.59µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 133.3µ 127.9µ ~ 1.000
CatchupWithHeaderCache-4 104.5m 104.3m ~ 0.700
_prepareTxsPerLevel-4 411.2m 406.0m ~ 0.700
_prepareTxsPerLevelOrdered-4 3.613m 3.673m ~ 1.000
_prepareTxsPerLevel_Comparison/Original-4 422.8m 414.0m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.569m 3.509m ~ 0.200
SubtreeSizes/10k_tx_4_per_subtree-4 1.420m 1.407m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 323.0µ 321.3µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 77.84µ 77.64µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 19.10µ 19.45µ ~ 0.200
SubtreeSizes/10k_tx_512_per_subtree-4 9.495µ 9.541µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.755µ 4.759µ ~ 1.000
SubtreeSizes/10k_tx_2k_per_subtree-4 2.349µ 2.364µ ~ 0.200
BlockSizeScaling/10k_tx_64_per_subtree-4 74.52µ 75.33µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 18.82µ 19.22µ ~ 0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.695µ 4.705µ ~ 0.400
BlockSizeScaling/50k_tx_64_per_subtree-4 388.9µ 398.5µ ~ 0.200
BlockSizeScaling/50k_tx_256_per_subtree-4 94.66µ 95.03µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.02µ 23.46µ ~ 0.100
SubtreeAllocations/small_subtrees_exists_check-4 159.0µ 165.5µ ~ 0.100
SubtreeAllocations/small_subtrees_data_fetch-4 161.7µ 163.2µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 323.3µ 326.8µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 9.446µ 9.889µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.520µ 9.579µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 18.90µ 19.02µ ~ 0.200
SubtreeAllocations/large_subtrees_exists_check-4 2.247µ 2.377µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.324µ 2.327µ ~ 1.000
SubtreeAllocations/large_subtrees_full_validation-4 4.680µ 4.810µ ~ 0.100
_BufferPoolAllocation/16KB-4 3.468µ 3.274µ ~ 0.100
_BufferPoolAllocation/32KB-4 7.221µ 7.105µ ~ 0.700
_BufferPoolAllocation/64KB-4 19.41µ 15.95µ ~ 0.200
_BufferPoolAllocation/128KB-4 33.11µ 28.40µ ~ 0.100
_BufferPoolAllocation/512KB-4 114.2µ 113.9µ ~ 1.000
_BufferPoolConcurrent/32KB-4 19.02µ 18.33µ ~ 0.100
_BufferPoolConcurrent/64KB-4 26.69µ 29.97µ ~ 0.100
_BufferPoolConcurrent/512KB-4 161.4µ 162.2µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/16KB-4 682.8µ 666.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 676.9µ 656.5µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/64KB-4 685.4µ 652.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 693.6µ 644.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 705.0µ 678.7µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 37.78m 37.81m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 37.91m 38.44m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 38.18m 37.60m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4 38.18m 37.96m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4 38.31m 37.93m ~ 0.100
_PooledVsNonPooled/Pooled-4 820.8n 820.4n ~ 1.000
_PooledVsNonPooled/NonPooled-4 7.572µ 7.232µ ~ 0.700
_MemoryFootprint/Current_512KB_32concurrent-4 8.908µ 8.324µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 10.92µ 10.55µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 10.50µ 10.10µ ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 248.2µ 249.5µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 250.8µ 249.6µ ~ 0.700
GetUtxoHashes-4 258.8n 261.2n ~ 1.000
GetUtxoHashes_ManyOutputs-4 42.72µ 42.69µ ~ 1.000
_NewMetaDataFromBytes-4 238.8n 236.9n ~ 0.700
_Bytes-4 623.2n 625.4n ~ 1.000
_MetaBytes-4 563.3n 571.4n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-15 18:10 UTC

@oskarszoon oskarszoon self-assigned this May 15, 2026
@oskarszoon oskarszoon requested review from freemans13 and ordishs May 15, 2026 13:49
@oskarszoon oskarszoon merged commit 9799a1b into bsv-blockchain:main May 18, 2026
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants