perf(legacy): drop cumulative-stats replay; right-size getdata InvList by oskarszoon · Pull Request #876 · bsv-blockchain/teranode

oskarszoon · 2026-05-15T12:58:39Z

Summary

Two perf fixes targeting legacy-sync CPU/alloc cost, found by analysing a probe captured on bsva-ovh-teranode-ttn-eu-4 mid-fat-block (height 1,276,417, 181,416 txs, v0.15.0-beta-13, 28h uptime). Captured profiles attributed ~42% of legacy CPU to a Prometheus histogram path that was replaying cumulative Aerospike bucket counts every 5s.

1. `perf(aerospike)`: delta-record cumulative stats

util/aerospike.go:initStats polled client.Stats() every 5s and for each histogram bucket called histogram.Observe(value) int(count) times. count is Aerospike's cumulative-since-process-start bucket counter (see aerospike-client-go/v8/cluster.go:aggregateNodeStats — per-node getAndReset resets node-local but accumulates into a persistent clstr.stats[h]), so every refresh re-observed the entire history. Same bug for scalar counters (counter.Add(cumulative) instead of delta).

Fix tracks last reported value per Prometheus key (counter and per histogram bucket) and emits only the delta. Resets are tolerated (treat current as delta). The stats-reflection loop body was extracted into a unit-testable processAerospikeStats helper. Added a missing time.Sleep on the Stats() error path.

Side-by-side benchmark in util/aerospike_stats_test.go (10M-warmup cumulative, 50k per-refresh delta, 11 histograms × 24 buckets):

Bench	ns/op
`BenchmarkProcessAerospikeStatsSteadyState` (delta)	127,200,370
`BenchmarkProcessAerospikeStatsLegacyReplay` (replay)	25,611,779,292

~200× per refresh. With a 5s configured interval the replay path took longer than the interval itself, so the goroutine never caught up.

Note: the underlying gocore.Stat.NewStat allocates &Stat{} before sync.Map.LoadOrStore lookup even when the stat already exists. This contributes ~18 GB cumulative alloc to tracing.UTracer.Start over a 28h run but is out of scope for this PR — would need an upstream ordishs/gocore patch or a local per-spanName cache in util/tracing.

2. `perf(legacy)`: size getdata InvList to `maxBlocks`

fetchHeaderBlocks allocated MsgGetData.InvList with sm.headerList.Len() as the size hint (often 2000) but the loop already breaks at maxBlocks (1–20 via blockSizeTracker.calculateMaxInFlightBlocks — drops to 1 for blocks >2 GB). Over-sized cap meant ~16 KB pointer storage allocated when only one slot would ever be used. Switched the size hint to maxBlocks.

Test plan

go test -race ./util/ — 524 pass
go test -race ./services/legacy/netsync/ — 52 pass
go vet ./util/ ./services/legacy/netsync/
go build ./...
Three new correctness tests (TestProcessAerospikeStatsCumulativeCounterDelta, …HistogramBucketDelta, …HistogramResetHandling) verify counter and histogram deltas + reset tolerance
Benchmark BenchmarkProcessAerospikeStatsLegacyReplay preserves a copy of the previous loop body for ongoing regression checks
Deploy to TTN host and re-capture CPU profile to confirm the histogram path drops out of the top frames

The Stats() goroutine spawned by initStats was treating Aerospike's cumulative bucket counts and counters as if they were per-refresh samples. For each refresh (default 5s) it iterated every histogram bucket and called histogram.Observe(value) int(count) times, where count is monotonically growing since process start. On the legacy service after 28h uptime this dominated CPU. CPU profile from a probe captured mid-fat-block on testnet attributed ~42% of legacy CPU to prometheus.histogram.observe and friends along this path. A side-by-side benchmark (10M-warmup cumulative, 50k per-refresh delta, 11 histograms x 24 buckets) measures: BenchmarkProcessAerospikeStatsSteadyState 127 ms/op (delta) BenchmarkProcessAerospikeStatsLegacyReplay 25,612 ms/op (replay) i.e. ~200x faster per refresh; the previous path took longer than the configured refresh interval so the goroutine never caught up. Fix: - Track the previously reported cumulative value per counter and per histogram bucket. Emit only (current - previous) to Prometheus. - Tolerate cumulative drops by treating the new value as the delta (covers client restart). - Extract the stats-reflection body into a unit-testable processAerospikeStats helper. - Add a missing time.Sleep on the Stats() error path so transient errors do not produce a tight retry loop. Tests: three correctness tests (counter delta, histogram delta, reset handling) plus benchmarks demonstrating the saving.

fetchHeaderBlocks allocated the MsgGetData InvList with a size hint of sm.headerList.Len() (often 2000) but the loop below breaks at maxBlocks (1-20, scaled by dynamic in-flight tracker - down to 1 for blocks >2 GB). The over-sized backing slice was ~16 KB of pointer storage allocated repeatedly even when only one slot was ever used. Use maxBlocks as the size hint, which matches the loop's actual upper bound and shrinks the per-call allocation from ~16 KB to ~8-160 B.

github-actions · 2026-05-15T12:59:55Z

🤖 Claude Code Review

Status: Complete

Summary

This PR delivers two focused performance optimizations backed by profiling data and comprehensive testing. The Aerospike stats fix addresses a critical CPU bottleneck (200× improvement), and the InvList sizing change eliminates unnecessary allocations during legacy sync.

No blocking issues found. The implementation is correct, well-tested, and follows the repository's verification standards (race detection, comprehensive unit tests, benchmarks preserving legacy behavior for regression checks).

Code Quality

✅ Correctness: Delta tracking correctly handles cumulative counters, including edge case of client restart/reset
✅ Testing: Three targeted unit tests verify counter delta, histogram delta, and reset handling
✅ Performance: Side-by-side benchmarks quantify the improvement (127ms vs 25,611ms)
✅ Maintainability: Extracted processAerospikeStats into a testable function with clear documentation

The tests align with AGENTS.md test-first approach (behavioral tests written before refactoring), and the benchmark preserves the legacy code path for ongoing regression verification.

Notes

The PR correctly identifies the root cause: Aerospike returns cumulative-since-start counters, but the old code re-added the full cumulative value every refresh
InvList sizing: The comment at line 1458-1461 clearly explains why maxBlocks is correct (loop breaks early, so large initial capacity wastes memory)
Added missing time.Sleep on error path (line 423) prevents busy loop on stats fetch failure

Copilot

Pull request overview

This PR addresses two performance issues found in legacy-sync profiling: (1) Aerospike stats reflection was re-observing cumulative bucket counts each 5s refresh (accounting for ~40% of legacy CPU), and (2) fetchHeaderBlocks was over-allocating the getdata InvList slice. The first fix tracks last-reported values per Prometheus key and emits deltas; the second right-sizes the InvList allocation to maxBlocks. The stats-processing loop is also extracted into a unit-testable helper, and a missing time.Sleep is added on the Stats() error path.

Changes:

Track per-key/per-bucket last cumulative values in util/aerospike.go and emit only deltas to Prometheus counters/histograms; extract processAerospikeStats; sleep on error.
Size MsgGetData.InvList to maxBlocks instead of sm.headerList.Len() in fetchHeaderBlocks.
Add unit tests and benchmarks (delta vs. legacy replay) for processAerospikeStats.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
util/aerospike.go	Refactors stats reflection into `processAerospikeStats` with per-key delta tracking via two new safemaps; adds error-path sleep.
util/aerospike_stats_test.go	New tests covering counter/histogram delta and reset handling, plus side-by-side benchmarks.
services/legacy/netsync/manager.go	Sizes getdata InvList to `maxBlocks` to avoid oversized allocations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sonarqubecloud · 2026-05-15T13:10:51Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
81.1% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-05-15T13:14:17Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-876 (3319fba)

Summary

Regressions: 0
Improvements: 0
Unchanged: 142
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.665µ	1.668µ	~	0.600
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	61.69n	61.56n	~	0.200
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	61.53n	61.65n	~	0.300
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	61.69n	61.61n	~	0.200
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	30.44n	29.45n	~	0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	49.13n	49.90n	~	0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	105.0n	106.6n	~	0.400
MiningCandidate_Stringify_Short-4	262.3n	266.8n	~	0.100
MiningCandidate_Stringify_Long-4	1.889µ	1.879µ	~	0.700
MiningSolution_Stringify-4	977.2n	962.5n	~	0.100
BlockInfo_MarshalJSON-4	1.741µ	1.759µ	~	0.300
NewFromBytes-4	154.6n	125.4n	~	0.100
Mine_EasyDifficulty-4	51.82µ	51.61µ	~	0.700
Mine_WithAddress-4	5.354µ	5.453µ	~	0.200
BlockAssembler_AddTx-4	0.02982n	0.02833n	~	0.700
AddNode-4	11.12	10.41	~	0.200
AddNodeWithMap-4	10.58	10.12	~	0.700
DirectSubtreeAdd/4_per_subtree-4	61.59n	59.40n	~	0.700
DirectSubtreeAdd/64_per_subtree-4	29.19n	29.19n	~	1.000
DirectSubtreeAdd/256_per_subtree-4	27.74n	27.83n	~	0.200
DirectSubtreeAdd/1024_per_subtree-4	26.53n	26.57n	~	0.700
DirectSubtreeAdd/2048_per_subtree-4	26.23n	26.20n	~	0.700
SubtreeProcessorAdd/4_per_subtree-4	280.1n	283.3n	~	0.400
SubtreeProcessorAdd/64_per_subtree-4	273.7n	286.6n	~	0.100
SubtreeProcessorAdd/256_per_subtree-4	277.9n	291.1n	~	0.100
SubtreeProcessorAdd/1024_per_subtree-4	265.6n	289.1n	~	0.100
SubtreeProcessorAdd/2048_per_subtree-4	268.2n	287.6n	~	0.100
SubtreeProcessorRotate/4_per_subtree-4	270.3n	297.0n	~	0.100
SubtreeProcessorRotate/64_per_subtree-4	269.0n	291.8n	~	0.100
SubtreeProcessorRotate/256_per_subtree-4	268.5n	289.9n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	270.3n	291.1n	~	0.100
SubtreeNodeAddOnly/4_per_subtree-4	54.96n	55.57n	~	0.700
SubtreeNodeAddOnly/64_per_subtree-4	36.06n	35.99n	~	0.700
SubtreeNodeAddOnly/256_per_subtree-4	35.03n	35.04n	~	1.000
SubtreeNodeAddOnly/1024_per_subtree-4	34.42n	34.57n	~	0.200
SubtreeCreationOnly/4_per_subtree-4	112.6n	110.1n	~	0.200
SubtreeCreationOnly/64_per_subtree-4	355.7n	344.4n	~	0.100
SubtreeCreationOnly/256_per_subtree-4	1.212µ	1.213µ	~	0.900
SubtreeCreationOnly/1024_per_subtree-4	3.748µ	3.788µ	~	0.200
SubtreeCreationOnly/2048_per_subtree-4	6.739µ	6.968µ	~	0.400
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	269.5n	292.9n	~	0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	270.5n	294.2n	~	0.100
ParallelGetAndSetIfNotExists/1k_nodes-4	552.0µ	620.4µ	~	0.100
ParallelGetAndSetIfNotExists/10k_nodes-4	1.372m	1.381m	~	0.700
ParallelGetAndSetIfNotExists/50k_nodes-4	6.694m	6.605m	~	0.200
ParallelGetAndSetIfNotExists/100k_nodes-4	13.18m	13.37m	~	0.100
SequentialGetAndSetIfNotExists/1k_nodes-4	623.3µ	700.9µ	~	0.100
SequentialGetAndSetIfNotExists/10k_nodes-4	2.909m	2.959m	~	0.400
SequentialGetAndSetIfNotExists/50k_nodes-4	11.06m	11.09m	~	1.000
SequentialGetAndSetIfNotExists/100k_nodes-4	21.35m	21.35m	~	1.000
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	597.4µ	675.5µ	~	0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	4.599m	4.642m	~	0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	17.00m	16.85m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	666.1µ	762.4µ	~	0.100
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	6.254m	6.292m	~	0.400
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	39.36m	41.72m	~	0.100
DiskTxMap_SetIfNotExists-4	3.798µ	3.962µ	~	0.300
DiskTxMap_SetIfNotExists_Parallel-4	3.486µ	3.750µ	~	0.100
DiskTxMap_ExistenceOnly-4	317.3n	355.3n	~	0.100
Queue-4	186.0n	187.9n	~	0.300
AtomicPointer-4	3.251n	3.234n	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	810.7µ	792.6µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4	769.5µ	763.2µ	~	0.400
ReorgOptimizations/AllMarkFalse/Old/10K-4	103.7µ	103.4µ	~	0.400
ReorgOptimizations/AllMarkFalse/New/10K-4	64.40µ	64.47µ	~	1.000
ReorgOptimizations/HashSlicePool/Old/10K-4	50.69µ	49.02µ	~	0.200
ReorgOptimizations/HashSlicePool/New/10K-4	11.19µ	10.90µ	~	0.200
ReorgOptimizations/NodeFlags/Old/10K-4	4.647µ	4.395µ	~	0.100
ReorgOptimizations/NodeFlags/New/10K-4	1.603µ	1.469µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	9.754m	9.243m	~	0.200
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.22m	10.14m	~	1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.091m	1.076m	~	0.100
ReorgOptimizations/AllMarkFalse/New/100K-4	707.6µ	705.4µ	~	0.100
ReorgOptimizations/HashSlicePool/Old/100K-4	516.7µ	472.8µ	~	0.700
ReorgOptimizations/HashSlicePool/New/100K-4	206.6µ	196.5µ	~	0.200
ReorgOptimizations/NodeFlags/Old/100K-4	48.26µ	45.58µ	~	0.100
ReorgOptimizations/NodeFlags/New/100K-4	16.81µ	15.92µ	~	0.200
TxMapSetIfNotExists-4	47.23n	46.59n	~	0.700
TxMapSetIfNotExistsDuplicate-4	39.00n	38.51n	~	0.100
ChannelSendReceive-4	620.0n	589.2n	~	0.100
CalcBlockWork-4	535.7n	521.2n	~	0.700
CalculateWork-4	697.3n	697.5n	~	1.000
BuildBlockLocatorString_Helpers/Size_10-4	1.293µ	1.286µ	~	0.100
BuildBlockLocatorString_Helpers/Size_100-4	12.21µ	12.59µ	~	0.100
BuildBlockLocatorString_Helpers/Size_1000-4	133.3µ	127.9µ	~	1.000
CatchupWithHeaderCache-4	104.5m	104.3m	~	0.700
_prepareTxsPerLevel-4	411.2m	406.0m	~	0.700
_prepareTxsPerLevelOrdered-4	3.613m	3.673m	~	1.000
_prepareTxsPerLevel_Comparison/Original-4	422.8m	414.0m	~	0.200
_prepareTxsPerLevel_Comparison/Optimized-4	3.569m	3.509m	~	0.200
SubtreeSizes/10k_tx_4_per_subtree-4	1.420m	1.407m	~	1.000
SubtreeSizes/10k_tx_16_per_subtree-4	323.0µ	321.3µ	~	1.000
SubtreeSizes/10k_tx_64_per_subtree-4	77.84µ	77.64µ	~	0.700
SubtreeSizes/10k_tx_256_per_subtree-4	19.10µ	19.45µ	~	0.200
SubtreeSizes/10k_tx_512_per_subtree-4	9.495µ	9.541µ	~	0.700
SubtreeSizes/10k_tx_1024_per_subtree-4	4.755µ	4.759µ	~	1.000
SubtreeSizes/10k_tx_2k_per_subtree-4	2.349µ	2.364µ	~	0.200
BlockSizeScaling/10k_tx_64_per_subtree-4	74.52µ	75.33µ	~	0.400
BlockSizeScaling/10k_tx_256_per_subtree-4	18.82µ	19.22µ	~	0.100
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.695µ	4.705µ	~	0.400
BlockSizeScaling/50k_tx_64_per_subtree-4	388.9µ	398.5µ	~	0.200
BlockSizeScaling/50k_tx_256_per_subtree-4	94.66µ	95.03µ	~	0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.02µ	23.46µ	~	0.100
SubtreeAllocations/small_subtrees_exists_check-4	159.0µ	165.5µ	~	0.100
SubtreeAllocations/small_subtrees_data_fetch-4	161.7µ	163.2µ	~	0.400
SubtreeAllocations/small_subtrees_full_validation-4	323.3µ	326.8µ	~	0.100
SubtreeAllocations/medium_subtrees_exists_check-4	9.446µ	9.889µ	~	0.100
SubtreeAllocations/medium_subtrees_data_fetch-4	9.520µ	9.579µ	~	0.700
SubtreeAllocations/medium_subtrees_full_validation-4	18.90µ	19.02µ	~	0.200
SubtreeAllocations/large_subtrees_exists_check-4	2.247µ	2.377µ	~	0.100
SubtreeAllocations/large_subtrees_data_fetch-4	2.324µ	2.327µ	~	1.000
SubtreeAllocations/large_subtrees_full_validation-4	4.680µ	4.810µ	~	0.100
_BufferPoolAllocation/16KB-4	3.468µ	3.274µ	~	0.100
_BufferPoolAllocation/32KB-4	7.221µ	7.105µ	~	0.700
_BufferPoolAllocation/64KB-4	19.41µ	15.95µ	~	0.200
_BufferPoolAllocation/128KB-4	33.11µ	28.40µ	~	0.100
_BufferPoolAllocation/512KB-4	114.2µ	113.9µ	~	1.000
_BufferPoolConcurrent/32KB-4	19.02µ	18.33µ	~	0.100
_BufferPoolConcurrent/64KB-4	26.69µ	29.97µ	~	0.100
_BufferPoolConcurrent/512KB-4	161.4µ	162.2µ	~	1.000
_SubtreeDeserializationWithBufferSizes/16KB-4	682.8µ	666.5µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	676.9µ	656.5µ	~	0.400
_SubtreeDeserializationWithBufferSizes/64KB-4	685.4µ	652.1µ	~	0.100
_SubtreeDeserializationWithBufferSizes/128KB-4	693.6µ	644.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/512KB-4	705.0µ	678.7µ	~	0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4	37.78m	37.81m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4	37.91m	38.44m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4	38.18m	37.60m	~	0.400
_SubtreeDataDeserializationWithBufferSizes/128KB-4	38.18m	37.96m	~	1.000
_SubtreeDataDeserializationWithBufferSizes/512KB-4	38.31m	37.93m	~	0.100
_PooledVsNonPooled/Pooled-4	820.8n	820.4n	~	1.000
_PooledVsNonPooled/NonPooled-4	7.572µ	7.232µ	~	0.700
_MemoryFootprint/Current_512KB_32concurrent-4	8.908µ	8.324µ	~	0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4	10.92µ	10.55µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	10.50µ	10.10µ	~	0.100
StoreBlock_Sequential/BelowCSVHeight-4	248.2µ	249.5µ	~	1.000
StoreBlock_Sequential/AboveCSVHeight-4	250.8µ	249.6µ	~	0.700
GetUtxoHashes-4	258.8n	261.2n	~	1.000
GetUtxoHashes_ManyOutputs-4	42.72µ	42.69µ	~	1.000
_NewMetaDataFromBytes-4	238.8n	236.9n	~	0.700
_Bytes-4	623.2n	625.4n	~	1.000
_MetaBytes-4	563.3n	571.4n	~	0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-15 18:10 UTC

oskarszoon added 2 commits May 15, 2026 14:57

icellan requested review from Copilot and icellan May 15, 2026 13:04

Copilot started reviewing on behalf of icellan May 15, 2026 13:05 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

oskarszoon self-assigned this May 15, 2026

oskarszoon requested review from freemans13 and ordishs May 15, 2026 13:49

icellan approved these changes May 15, 2026

View reviewed changes

Merge branch 'main' into feat/legacy-perf

b4f1f6c

ordishs approved these changes May 17, 2026

View reviewed changes

oskarszoon merged commit 9799a1b into bsv-blockchain:main May 18, 2026
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(legacy): drop cumulative-stats replay; right-size getdata InvList#876

perf(legacy): drop cumulative-stats replay; right-size getdata InvList#876
oskarszoon merged 3 commits into
bsv-blockchain:mainfrom
oskarszoon:feat/legacy-perf

oskarszoon commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

oskarszoon commented May 15, 2026

Summary

1. perf(aerospike): delta-record cumulative stats

2. perf(legacy): size getdata InvList to maxBlocks

Test plan

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code Quality

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

sonarqubecloud Bot commented May 15, 2026

Quality Gate passed

Uh oh!

github-actions Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. `perf(aerospike)`: delta-record cumulative stats

2. `perf(legacy)`: size getdata InvList to `maxBlocks`

github-actions Bot commented May 15, 2026 •

edited

Loading

github-actions Bot commented May 15, 2026 •

edited

Loading