Summary
model.Block.checkDuplicateTransactionsInSubtree walks all previous subtrees on every invocation, calling Subtree.Size() for each — an O(N²) prefix-sum where N is the subtree count. On a large block (450k subtrees) this saturates every core in atomic contention on Subtree.Size's internal RWMutex, with no useful work done.
Regression source
Introduced in commit 58a68d376 (PR #198, "4364: Test with blocks which have early duplicates which are fully sp…", 2025-11-26).
Before:
idx64, err = safeconversion.IntToUint64((subIdx * subtreeSize) + txIdx)
After:
// Calculate the base index for this subtree by summing sizes of all previous subtrees.
// We cannot use (subIdx * subtreeSize) because the last subtree may be smaller.
// Per Block.go:1192: "all subtrees need to be the same size as the first tree, except the last one"
baseIdx := 0
for i := 0; i < subIdx; i++ {
baseIdx += b.SubtreeSlices[i].Size()
}
...
idx64, err = safeconversion.IntToUint64(baseIdx + txIdx)
The justification comment is incorrect: baseIdx is the sum of sizes of subtrees before subIdx. Even when subIdx is the last (smaller) subtree, the subtrees being summed are all full-size by the same "first tree, except the last one" invariant. So subIdx * subtreeSize is correct for all subIdx.
Cost
- N = number of subtrees in the block. Total
Size() calls across all goroutines = N × (N−1) / 2.
- Each
Size() call (in go-subtree):
// github.com/bsv-blockchain/go-subtree.(*Subtree).Size()
func (st *Subtree) Size() int {
st.mu.RLock() // atomic.AddInt32(&readerCount, +1)
size := cap(st.Nodes)
st.mu.RUnlock() // atomic.AddInt32(&readerCount, -1)
return size
}
Two atomic Int32 ops per call → ~N² atomic-add bus transactions, all on the same cache lines, with checkDuplicateTransactionsConcurrency workers thrashing in parallel.
Production reproduction
bsva-ovh-teranode-ttn-eu-4 today, height 1622352 (txCount = 900,679, 450,340 subtrees, 177 MB).
30 s CPU profile of blockvalidation while processing this block:
99.94% model.(*Block).checkDuplicateTransactionsInSubtree
98.33% → go-subtree.(*Subtree).Size
67.29% → sync.(*RWMutex).RLock
96.56% → sync/atomic.(*Int32).Add (inline)
docker stats showed blockvalidation at 1161% CPU (11.6 cores) for ~6+ minutes, all wasted on contention. Effective work done in the dedup step during the sample was zero.
Proposed fix
func (b *Block) checkDuplicateTransactionsInSubtree(subtree *subtreepkg.Subtree, subIdx, subtreeSize int) (err error) {
var idx64 uint64
baseIdx := subIdx * subtreeSize // O(1); valid because all subtrees before subIdx are full-size per Block.go:1192 invariant
for txIdx := 0; txIdx < len(subtree.Nodes); txIdx++ {
...
}
return nil
}
Optional defensiveness: compute the prefix sums once in checkDuplicateTransactions (the caller) and pass baseIdx directly to each goroutine, so the function doesn't have to reason about the invariant. Either way the cost drops from O(N²) atomic-ops to O(1) per subtree.
Test plan
Related
Summary
model.Block.checkDuplicateTransactionsInSubtreewalks all previous subtrees on every invocation, callingSubtree.Size()for each — an O(N²) prefix-sum where N is the subtree count. On a large block (450k subtrees) this saturates every core in atomic contention onSubtree.Size's internal RWMutex, with no useful work done.Regression source
Introduced in commit
58a68d376(PR #198, "4364: Test with blocks which have early duplicates which are fully sp…", 2025-11-26).Before:
After:
The justification comment is incorrect:
baseIdxis the sum of sizes of subtrees beforesubIdx. Even whensubIdxis the last (smaller) subtree, the subtrees being summed are all full-size by the same "first tree, except the last one" invariant. SosubIdx * subtreeSizeis correct for allsubIdx.Cost
Size()calls across all goroutines = N × (N−1) / 2.Size()call (ingo-subtree):Two atomic Int32 ops per call → ~N² atomic-add bus transactions, all on the same cache lines, with
checkDuplicateTransactionsConcurrencyworkers thrashing in parallel.Production reproduction
bsva-ovh-teranode-ttn-eu-4today, height 1622352 (txCount = 900,679, 450,340 subtrees, 177 MB).30 s CPU profile of
blockvalidationwhile processing this block:docker statsshowed blockvalidation at 1161% CPU (11.6 cores) for ~6+ minutes, all wasted on contention. Effective work done in the dedup step during the sample was zero.Proposed fix
Optional defensiveness: compute the prefix sums once in
checkDuplicateTransactions(the caller) and passbaseIdxdirectly to each goroutine, so the function doesn't have to reason about the invariant. Either way the cost drops from O(N²) atomic-ops to O(1) per subtree.Test plan
Related