Skip to content

fix(model): O(1) base index in dedup, drop O(N^2) prefix-sum (#900)#910

Merged
ordishs merged 9 commits into
bsv-blockchain:mainfrom
ordishs:fix/900-dedup-prefix-sum-on2
May 20, 2026
Merged

fix(model): O(1) base index in dedup, drop O(N^2) prefix-sum (#900)#910
ordishs merged 9 commits into
bsv-blockchain:mainfrom
ordishs:fix/900-dedup-prefix-sum-on2

Conversation

@ordishs

@ordishs ordishs commented May 19, 2026

Copy link
Copy Markdown
Collaborator

Summary

Restores the O(1) baseIdx = subIdx * subtreeSize formula in model.Block.checkDuplicateTransactionsInSubtree. PR #198 had replaced it with a prefix-sum loop over Subtree.Size(), which scaled as O(N²) under the concurrent dedup workers because Subtree.Size() takes the subtree's RWMutex.

Fixes #900.

Why it was broken

PR #198's comment claimed the multiplication was wrong because "the last subtree may be smaller". But validateAndSetTransactionCount (Block.go:1293) enforces "all subtrees need to be the same size as the first tree, except the last one". The prefix-sum loop only sums subtrees strictly before subIdx, so every summed subtree is full-size — subIdx * subtreeSize is exact in every case.

Production impact

The cost is O(N²) atomic-add bus transactions on shared cache lines, with checkDuplicateTransactionsConcurrency workers thrashing in parallel. On bsva-ovh-teranode-ttn-eu-4 (height 1622352, 900,679 txs, 450,340 subtrees, 177 MB):

99.94%  model.(*Block).checkDuplicateTransactionsInSubtree
98.33%  → go-subtree.(*Subtree).Size
67.29%  → sync.(*RWMutex).RLock
96.56%  → sync/atomic.(*Int32).Add  (inline)

docker stats showed blockvalidation at 1161% CPU (11.6 cores) for 6+ minutes with effectively zero useful dedup work.

Changes

  • model/Block.go: replace the prefix-sum loop with baseIdx := subIdx * subtreeSize, updating the comment to explain the invariant being relied on and the contention pattern the loop produced.
  • model/Block_test.go: add TestBlock_CheckDuplicateTransactions_ManySubtrees — a regression guard that runs the concurrent dedup path on a 2,000-subtree block (last subtree smaller), asserts global-index correctness via probed nodes, and enforces a generous wall-time budget any reintroduced O(N²) would blow past.

Test plan

  • go test -race ./model/... — full model package green (17s, race detector on)
  • go vet ./... — no new warnings
  • golangci-lint run ./model/... — clean (pre-existing prealloc warnings only)
  • go build ./... — whole repo builds
  • New regression test passes in ~0.1s on the O(1) path; existing dedup tests from PR 4364: Test with blocks which have early duplicates which are fully sp… #198 (TestBlock_CheckDuplicateTransactionsInSubtree, TestCheckDuplicateTransactions, TestBlock_Valid_DupTxDetected_NilSubtreeStore) still catch duplicates correctly — the fix changes only the base-index computation, not the dedup logic itself.

Closes #900

ordishs added 8 commits May 13, 2026 15:38
…ckchain#900)

PR bsv-blockchain#198 replaced the O(1) baseIdx = subIdx*subtreeSize formula with
a prefix-sum loop summing Subtree.Size() across all previous subtrees.
Subtree.Size() takes the subtree's RWMutex internally, so under the
concurrent dedup workers the per-block cost scaled as O(N^2) atomic
ops on shared cache lines.

On a 450k-subtree block in production this pinned every available
core on lock contention with effectively zero useful dedup work being
done — blockvalidation sat at 1161% CPU for 6+ minutes.

The justification comment was incorrect: validateAndSetTransactionCount
enforces "all subtrees must be the same size as the first tree, except
the last one". Because the loop only sums subtrees strictly before
subIdx, every summed subtree is full-size, so subIdx*subtreeSize gives
the exact same baseIdx the loop would produce.

Adds a regression test that exercises the dedup path with 2000
subtrees (last one smaller) and asserts both index correctness at
scale and a generous wall-time budget any reintroduced O(N^2) would
trivially blow past.
@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Current Review:

No issues found. The fix correctly addresses the O(N²) performance regression from PR #198.

Analysis:

  • Correctness: The O(1) formula baseIdx = subIdx * subtreeSize is mathematically correct. The invariant enforced at Block.go:1291-1293 guarantees all subtrees except the last have identical size, so summing previous subtrees is equivalent to multiplication.
  • Performance fix: Eliminates ~N²/2 calls to Subtree.Size(), each taking an RWMutex, which caused catastrophic contention (11.6 cores saturated for 6+ minutes on a 450k-subtree block in production).
  • Test coverage: The new regression test validates both correctness (via probe-point assertions) and performance (via 2s timeout that would catch any O(N²) reintroduction). Test exercises 2000 subtrees with a smaller last subtree, directly covering the edge case that motivated the incorrect prefix-sum in PR 4364: Test with blocks which have early duplicates which are fully sp… #198.
  • Documentation: Inline comments clearly explain the invariant being relied upon and the contention pattern that the prefix-sum loop produced.

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-910 (d99ff3d)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.760µ 1.684µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 60.45n 60.91n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 59.41n 59.44n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 59.36n 59.33n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 36.02n 33.58n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 66.10n 59.02n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 184.0n 151.2n ~ 0.100
MiningCandidate_Stringify_Short-4 258.1n 249.5n ~ 0.100
MiningCandidate_Stringify_Long-4 1.773µ 1.743µ ~ 0.100
MiningSolution_Stringify-4 926.2n 910.1n ~ 0.100
BlockInfo_MarshalJSON-4 1.745µ 1.752µ ~ 0.600
NewFromBytes-4 127.2n 131.3n ~ 1.000
AddTxBatchColumnar_Validation-4 2.552µ 2.504µ ~ 0.100
OffsetValidationLoop-4 638.1n 635.6n ~ 0.800
Mine_EasyDifficulty-4 61.13µ 60.91µ ~ 1.000
Mine_WithAddress-4 7.664µ 7.011µ ~ 0.700
BlockAssembler_AddTx-4 0.02643n 0.02658n ~ 1.000
AddNode-4 10.54 10.70 ~ 0.400
AddNodeWithMap-4 11.33 11.54 ~ 0.200
DirectSubtreeAdd/4_per_subtree-4 57.17n 58.09n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 31.52n 28.52n ~ 0.100
DirectSubtreeAdd/256_per_subtree-4 30.19n 27.24n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 28.90n 26.14n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 28.47n 25.73n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 280.8n 280.5n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 277.3n 276.4n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 280.2n 279.5n ~ 0.400
SubtreeProcessorAdd/1024_per_subtree-4 269.5n 268.3n ~ 1.000
SubtreeProcessorAdd/2048_per_subtree-4 272.0n 276.1n ~ 0.400
SubtreeProcessorRotate/4_per_subtree-4 274.6n 277.0n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 272.4n 274.1n ~ 0.500
SubtreeProcessorRotate/256_per_subtree-4 273.7n 277.4n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 273.4n 274.9n ~ 0.400
SubtreeNodeAddOnly/4_per_subtree-4 54.04n 53.91n ~ 0.400
SubtreeNodeAddOnly/64_per_subtree-4 34.22n 34.12n ~ 0.400
SubtreeNodeAddOnly/256_per_subtree-4 33.28n 33.19n ~ 0.500
SubtreeNodeAddOnly/1024_per_subtree-4 32.63n 32.54n ~ 0.700
SubtreeCreationOnly/4_per_subtree-4 113.8n 112.5n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 403.8n 391.1n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.368µ 1.318µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.478µ 4.386µ ~ 0.200
SubtreeCreationOnly/2048_per_subtree-4 8.322µ 7.895µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 275.8n 269.9n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 275.9n 271.8n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.219m 2.189m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 5.468m 5.326m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 7.696m 7.163m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 10.76m 10.01m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.953m 1.978m ~ 0.200
SequentialGetAndSetIfNotExists/10k_nodes-4 4.576m 4.337m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 14.16m 12.19m ~ 0.100
SequentialGetAndSetIfNotExists/100k_nodes-4 25.38m 22.21m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.288m 2.232m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.395m 8.203m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.60m 12.84m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.968m 1.976m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 7.771m 7.488m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 40.05m 39.76m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.577µ 3.610µ ~ 0.400
DiskTxMap_SetIfNotExists_Parallel-4 3.422µ 3.303µ ~ 0.400
DiskTxMap_ExistenceOnly-4 331.4n 319.3n ~ 0.700
Queue-4 188.7n 187.0n ~ 0.400
AtomicPointer-4 3.875n 4.839n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 865.4µ 841.1µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 813.3µ 791.3µ ~ 0.200
ReorgOptimizations/AllMarkFalse/Old/10K-4 104.7µ 104.4µ ~ 0.700
ReorgOptimizations/AllMarkFalse/New/10K-4 97.54µ 62.69µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/10K-4 70.70µ 57.24µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/10K-4 11.88µ 11.55µ ~ 0.700
ReorgOptimizations/NodeFlags/Old/10K-4 5.585µ 4.916µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 2.169µ 1.700µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.726m 9.669m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.19m 10.14m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.104m 1.102m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 1164.8µ 685.8µ ~ 0.100
ReorgOptimizations/HashSlicePool/Old/100K-4 554.2µ 552.5µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/100K-4 339.5µ 338.6µ ~ 0.400
ReorgOptimizations/NodeFlags/Old/100K-4 51.16µ 50.83µ ~ 0.400
ReorgOptimizations/NodeFlags/New/100K-4 17.86µ 17.57µ ~ 1.000
TxMapSetIfNotExists-4 52.66n 52.30n ~ 0.700
TxMapSetIfNotExistsDuplicate-4 39.69n 40.42n ~ 0.100
ChannelSendReceive-4 574.3n 586.7n ~ 0.100
CalcBlockWork-4 496.3n 497.3n ~ 1.000
CalculateWork-4 693.6n 705.0n ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.032µ 1.068µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_100-4 9.893µ 10.003µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 109.9µ 114.8µ ~ 0.400
CatchupWithHeaderCache-4 103.8m 104.0m ~ 0.100
_prepareTxsPerLevel-4 424.0m 418.3m ~ 0.400
_prepareTxsPerLevelOrdered-4 3.755m 3.711m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 413.9m 421.0m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.643m 3.755m ~ 0.700
SubtreeSizes/10k_tx_4_per_subtree-4 1.367m 1.377m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 327.1µ 319.9µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 76.75µ 76.63µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 19.15µ 19.21µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 9.423µ 9.511µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.770µ 4.781µ ~ 0.400
SubtreeSizes/10k_tx_2k_per_subtree-4 2.349µ 2.362µ ~ 0.700
BlockSizeScaling/10k_tx_64_per_subtree-4 74.77µ 74.45µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 19.10µ 19.11µ ~ 1.000
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.692µ 4.777µ ~ 0.700
BlockSizeScaling/50k_tx_64_per_subtree-4 399.4µ 394.5µ ~ 0.100
BlockSizeScaling/50k_tx_256_per_subtree-4 94.38µ 93.54µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.42µ 23.27µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 167.7µ 164.0µ ~ 0.200
SubtreeAllocations/small_subtrees_data_fetch-4 163.8µ 162.6µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 326.4µ 328.6µ ~ 0.700
SubtreeAllocations/medium_subtrees_exists_check-4 9.691µ 9.381µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.496µ 9.477µ ~ 1.000
SubtreeAllocations/medium_subtrees_full_validation-4 18.96µ 18.99µ ~ 0.400
SubtreeAllocations/large_subtrees_exists_check-4 2.359µ 2.268µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.296µ 2.348µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.723µ 4.735µ ~ 0.200
_BufferPoolAllocation/16KB-4 3.834µ 3.724µ ~ 0.700
_BufferPoolAllocation/32KB-4 7.688µ 9.086µ ~ 0.700
_BufferPoolAllocation/64KB-4 15.66µ 15.65µ ~ 1.000
_BufferPoolAllocation/128KB-4 27.88µ 26.39µ ~ 0.200
_BufferPoolAllocation/512KB-4 114.8µ 100.5µ ~ 0.100
_BufferPoolConcurrent/32KB-4 18.35µ 18.18µ ~ 1.000
_BufferPoolConcurrent/64KB-4 28.75µ 28.78µ ~ 0.700
_BufferPoolConcurrent/512KB-4 140.4µ 140.8µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 603.7µ 630.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 603.0µ 639.6µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 603.6µ 599.0µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/128KB-4 582.3µ 588.4µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/512KB-4 584.3µ 585.1µ ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/16KB-4 35.51m 36.38m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 35.66m 36.02m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/64KB-4 35.27m 36.27m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 35.30m 35.88m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.33m 35.99m ~ 0.100
_PooledVsNonPooled/Pooled-4 831.7n 831.0n ~ 0.400
_PooledVsNonPooled/NonPooled-4 7.405µ 7.411µ ~ 1.000
_MemoryFootprint/Current_512KB_32concurrent-4 6.670µ 6.968µ ~ 0.100
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.198µ 9.219µ ~ 1.000
_MemoryFootprint/Alternative_64KB_32concurrent-4 8.859µ 9.107µ ~ 0.200
StoreBlock_Sequential/BelowCSVHeight-4 328.2µ 328.9µ ~ 0.700
StoreBlock_Sequential/AboveCSVHeight-4 334.0µ 336.1µ ~ 1.000
GetUtxoHashes-4 275.3n 261.7n ~ 0.400
GetUtxoHashes_ManyOutputs-4 43.19µ 42.92µ ~ 0.200
_NewMetaDataFromBytes-4 237.5n 226.8n ~ 0.100
_Bytes-4 401.3n 397.8n ~ 1.000
_MetaBytes-4 339.8n 339.0n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-05-19 23:26 UTC

@ordishs ordishs requested review from liam and oskarszoon May 19, 2026 23:08
The O(1) path completes the checkDuplicateTransactions call in well under
100 ms; 30s left no real fence against a re-introduced O(N^2) regression.
2s sits ~20x above measured headroom, comfortably accommodates slow CI
runners, and still trips immediately on the production-style scaling
issue from bsv-blockchain#900.
@sonarqubecloud

Copy link
Copy Markdown

@oskarszoon oskarszoon left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root fix as my (now-closed) #904: baseIdx := subIdx * subtreeSize per the validateAndSetTransactionCount invariant. Closed mine in favour of this.

Comment explanation of the RWMutex/atomic contention mechanism reads better than mine. Perf-test budget at 2s is a meaningful improvement — tight enough to catch a regression in CI without the 30s budget I had. Probed-correctness via global indices including the smaller-last edge is clean.

Two edge cases worth carrying as a follow-up (not blocking):

  • subtreeSize=1 degenerate path (verifies the formula still holds when there's no "predecessor full-size" assumption to lean on).
  • Explicit cross-subtree duplicate detection (a duplicate where one occurrence is at subIdx=0 and another at subIdx>0 — exercises the global-index arithmetic in the dedup logic itself, not just at the baseIdx boundary).

Either backport onto this branch or open as a separate PR.

@ordishs ordishs merged commit 2a0a5fe into bsv-blockchain:main May 20, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

checkDuplicateTransactionsInSubtree O(N²) prefix sum pins all cores on big blocks (regression from #198)

3 participants