perf(subtreeprocessor): optimize reorg with parallel bulk operations#526
Conversation
…InpointsMap with parallel bulk operations - Deleted hashSlicePool to streamline memory management. - Introduced parallel bulk operations in SplitTxInpointsMap for improved performance during transaction handling. - Added comprehensive tests for SplitTxInpointsMap functionalities, ensuring correctness under concurrent access. - Removed outdated benchmark tests and added new benchmarks for reorg operations.
|
🤖 Claude Code Review Status: Complete All previously reported issues have been verified as fixed. The PR demonstrates strong engineering discipline: Performance Optimizations (verified):
Behavioral Change (well-tested):
This is thoroughly covered by TestCheckMarkNotOnLongestChain, TestCheckMarkNotOnLongestChain_TxNotFound, and TestMarkNotOnLongestChain_InvalidGating. Code Quality:
History:
No new issues identified. |
…ocessing - Updated the calculation of numWorkers in processRemainderTxHashes to cap the maximum number of workers at 16, enhancing resource management during parallel processing.
…and improve error handling - Updated the moveBackBlock method to return fewer values, enhancing clarity and reducing complexity. - Adjusted related benchmark and test cases to reflect the new method signature. - Improved error handling for cases where a block is not provided, ensuring consistent error messaging.
…eorg-performance-optimisation # Conflicts: # services/blockassembly/subtreeprocessor/SubtreeProcessor.go
There was a problem hiding this comment.
Pull request overview
This PR focuses on improving subtreeprocessor reorg throughput by reducing lock contention and introducing parallel/bulk-style processing paths, alongside adding benchmarks to quantify the impact.
Changes:
- Reworks
SplitTxInpointsMapto useswiss.Mapwith per-bucket mutexes and addsParallelBulkSetIfNotExists. - Adds bulk/parallel subtree rebuild logic for reorg paths and runs mark-on-longest-chain operations concurrently.
- Adds production-scale reorg/marking benchmarks and removes older allocation-focused benchmark/pooling code.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
services/blockassembly/subtreeprocessor/SubtreeProcessor.go |
Adds bulk subtree construction + concurrent marking; refactors reorg/moveBack remainder processing to use bulk paths. |
services/blockassembly/subtreeprocessor/map.go |
Updates SplitTxInpointsMap implementation and introduces ParallelBulkSetIfNotExists. |
services/blockassembly/subtreeprocessor/map_test.go |
Adds unit tests for the new SplitTxInpointsMap behavior and bulk insert correctness/concurrency. |
services/blockassembly/subtreeprocessor/reorg_benchmark_test.go |
Adds large-scale reorg benchmarks and baseline test runner. |
stores/utxo/aerospike/longest_chain_production_bench_test.go |
Adds benchmark comparing sequential vs concurrent marking on disjoint sets. |
services/blockassembly/subtreeprocessor/reorg_alloc_benchmark_test.go |
Removes prior allocation-focused benchmark suite. |
services/blockassembly/subtreeprocessor/hash_slice_pool.go |
Removes hash slice pooling implementation. |
services/blockassembly/subtreeprocessor/SubtreeProcessor_test.go |
Updates tests to call moveBackBlock directly and uses a blockchain mock where now required. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix logic bug where continue only broke inner loop, causing CheckBlockIsInCurrentChain to always be called even when tx was found in recent blocks. Use flag variable instead. - Thread ctx through bulkBuildSubtrees to propagate cancellation and tracing spans via errgroup.WithContext(ctx). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use stp.newSubtree() instead of subtreepkg.NewTreeByLeafCount() in bulkBuildSubtrees to respect mmap configuration - Add SubtreeIndex bookkeeping for diskTxMap in bulkBuildSubtrees so removeTxFromSubtrees can do O(1) lookup instead of linear scan - Reset currentSubtree when completed subtree moves to chainedSubtrees to prevent dual-reference corruption - Add nil guard for previousCurrentSubtree in moveBackBlockBulkBuild - Fix error wrapping that included potentially nil err variable - Add precondition checks in ParallelBulkSetIfNotExists - Fix benchmark goroutine leak with done channel cleanup - Replace var _ = bt.NewTx() with var _ *bt.Tx Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add DeletedTxs and OnStorageComplete to finalizeBulkBuildSubtrees NewSubtreeRequest to prevent race conditions with tx inpoints and cleanup leaked deletedTxs entries - Replace goroutine-per-subtree with batch send/wait pattern using buffered errChans to bound concurrency - Use explicit per-iteration copy for hash pointer in checkMarkNotOnLongestChain for clarity Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-05 16:34 UTC |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| return true | ||
| }) | ||
| rawLosingTxHashes = append(rawLosingTxHashes, losingTxHashesMap.Keys()...) |
There was a problem hiding this comment.
In reorgBlocks, building rawLosingTxHashes via losingTxHashesMap.Keys() reintroduces a full-slice allocation/copy of all losing hashes. Since txmap.TxMap already supports Iter(...) (used previously), this can be avoided by iterating the map directly and building the deduped/filtered losing set without the intermediate []chainhash.Hash.
This is in the critical reorg hot path and appears to negate the intended “avoid Keys()” optimization described in the PR summary.
| rawLosingTxHashes = append(rawLosingTxHashes, losingTxHashesMap.Keys()...) | |
| losingTxHashesMap.Iter(func(hash chainhash.Hash, _ struct{}) bool { | |
| rawLosingTxHashes = append(rawLosingTxHashes, hash) | |
| return true | |
| }) |
There was a problem hiding this comment.
Good catch — switched from losingTxHashesMap.Keys() to losingTxHashesMap.Iter() to avoid the intermediate slice allocation, consistent with how transactionMap.Iter() is used just above for the winning tx set.
| // Pre-allocate result arrays indexed by position | ||
| existedInTxMap := make([]bool, n) // true if SetIfExists found the key | ||
| existsInLosingMap := make([]bool, n) // true if in losingTxHashesMap | ||
| isRemoveMap := make([]bool, n) // true if in removeMap (to delete) |
There was a problem hiding this comment.
processRemainderTxHashes switched from a packed []byte bitset to three separate []bool arrays (existedInTxMap, existsInLosingMap, isRemoveMap). For large subtrees this is a 3× memory increase (plus GC pressure) in a very hot reorg path.
If the previous bitset approach was correct, consider restoring it (or another packed representation) to keep per-node state to ~1 byte rather than ~3 bytes.
There was a problem hiding this comment.
I looked into this and the 3x memory difference (3 bytes vs 1 byte per node for []bool vs packed bitset) is negligible relative to the node data itself (each subtreepkg.Node is ~72 bytes with hash + fee + size). At 100K nodes, it's ~300KB vs ~100KB for the flags, while the node data is ~7MB. The clarity benefit of separate named bools (existedInTxMap, existsInLosingMap, isRemoveMap) outweighs the marginal memory saving in this case. Happy to revisit if profiling shows this becoming a bottleneck at larger scales.
| // TestReorgBenchmarkBaseline runs the full reorg at multiple scales and prints a summary table. | ||
| // This is a test (not benchmark) for easy "go test -run" invocation with human-readable output. | ||
| func TestReorgBenchmarkBaseline(t *testing.T) { | ||
| if testing.Short() { | ||
| t.Skip("skipping baseline benchmark in short mode") | ||
| } | ||
|
|
There was a problem hiding this comment.
TestReorgBenchmarkBaseline is a regular test (not a benchmark) and will run in normal go test runs unless -short is enabled. Given it performs multi-scale reorg runs up to 100K transactions, this is likely to make CI/unit test runs very slow or flaky.
Consider moving this into a Benchmark... (so it only runs under -bench), gating it behind a build tag (e.g., //go:build longtest), or requiring an explicit env var to run.
There was a problem hiding this comment.
The test already has a testing.Short() guard that skips it in short mode (line 758). Our CI uses make test which passes -short, so this won't run in normal CI. For local go test without -short, the 100K scale takes ~15s total which is reasonable for a correctness baseline. I think the current gating is sufficient, but happy to add a build tag if it causes issues in practice.
- Replace losingTxHashesMap.Keys() with Iter() to avoid intermediate slice allocation in the reorg hot path Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| h := hash // per-iteration copy for pointer safety | ||
| g.Go(func() error { | ||
| txMeta, err := stp.utxoStore.Get(gCtx, &h, fields.BlockIDs) | ||
| if err != nil { | ||
| return errors.NewProcessingError("[reorgBlocks] error getting transaction from utxo store", err) | ||
| } | ||
|
|
||
| txMetas[idx] = txMeta |
There was a problem hiding this comment.
The goroutine closure inside the for idx, hash := range markNotOnLongestChain loop captures the loop variable idx. Because range variables are reused, concurrent goroutines can write to the wrong index in txMetas, leading to nil entries or out-of-bounds behavior and incorrect marking decisions. Capture idx per iteration (e.g., idx := idx or pass it as a parameter to the closure) before assigning txMetas[idx] = txMeta.
| h := hash // per-iteration copy for pointer safety | |
| g.Go(func() error { | |
| txMeta, err := stp.utxoStore.Get(gCtx, &h, fields.BlockIDs) | |
| if err != nil { | |
| return errors.NewProcessingError("[reorgBlocks] error getting transaction from utxo store", err) | |
| } | |
| txMetas[idx] = txMeta | |
| h := hash // per-iteration copy for pointer safety | |
| idxCopy := idx // per-iteration copy to avoid closure capturing loop variable | |
| g.Go(func() error { | |
| txMeta, err := stp.utxoStore.Get(gCtx, &h, fields.BlockIDs) | |
| if err != nil { | |
| return errors.NewProcessingError("[reorgBlocks] error getting transaction from utxo store", err) | |
| } | |
| txMetas[idxCopy] = txMeta |
There was a problem hiding this comment.
No longer applies on current HEAD (Go 1.26 per go.mod). Since Go 1.22 loop variables are per-iteration scoped, so the closure capturing idx is race-free, and each goroutine writes a distinct txMetas[idx]. The code also keeps the defensive h := hash copy for the &h pointer. No shared-index write race exists.
| for i := 0; i < count; i++ { | ||
| if isFirst && i == 0 { | ||
| // Skip first slot for coinbase in first subtree | ||
| continue | ||
| } | ||
| err = subtree.AddSubtreeNode(subtreepkg.Node{ | ||
| Hash: txHashes[idx+i], |
There was a problem hiding this comment.
In the first subtree case, the loop for i := 0; i < count; i++ { if isFirst && i == 0 { continue } ... txHashes[idx+i] } skips txHashes[idx] entirely, but idx is later advanced by subtree.Length()-1. This drops the first tx hash and then causes the last tx of the first subtree to be duplicated at the start of the next subtree, skewing the benchmark scenario (and potentially introducing duplicates). Adjust the indexing/count logic for the coinbase slot (e.g., iterate over tx hashes separately from leaf positions, or set count to min(remaining, subtreeSize-1) for the first subtree).
There was a problem hiding this comment.
Fixed in commit dc3e520. buildAndStoreSubtrees now tracks a separate consumed counter and indexes txHashes[idx+consumed], and advances idx by consumed — so txHashes[idx] is no longer dropped and no tx is duplicated across subtree boundaries.
| for i := 0; i < count; i++ { | ||
| if isFirst && i == 0 { | ||
| continue | ||
| } | ||
| err = subtree.AddSubtreeNode(subtreepkg.Node{ | ||
| Hash: txHashes[idx+i], | ||
| Fee: 100, |
There was a problem hiding this comment.
Same off-by-one issue as buildAndStoreSubtrees: when isFirst && i == 0 is skipped, txHashes[idx] is never added but idx is only advanced by subtree.Length()-1, causing the last tx from the first subtree to be reused as the first tx in the next subtree. This makes the benchmark data contain unintended gaps/duplicates. Fix by aligning tx-hash indexing with the coinbase leaf handling (e.g., reduce count for the first subtree or increment a separate tx index when adding nodes).
There was a problem hiding this comment.
Fixed in commit dc3e520, same approach as buildAndStoreSubtrees: buildAndStoreSubtreesT uses a consumed counter for txHashes indexing and advances idx by consumed, eliminating the dropped-first-tx / duplicated-last-tx off-by-one.
…eorg-performance-optimisation
…eorg-performance-optimisation
…erge The merge from upstream/main changed the markNotOnLongestChain function signature to include moveBackBlocks and moveForwardBlocks parameters. The reset path passes nil for both since it only needs to unmark assembly tx hashes directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ck test TestHandleReorgWithInvalidBlock_Integration (added in bsv-blockchain#545) fails on this branch and on main because the test setup does not run the BlockValidation service. InvalidateBlock sets mined_set=false and emits BlockMinedUnset, which BlockValidation normally consumes to re-set mined_set=true. With no BlockValidation running, BA's reset() blocks in waitForBlockMinedSet until its retry budget exhausts — past the test's 15s Eventually window. Capture the hashes returned by InvalidateBlock and call SetBlockMinedSet on each to simulate BlockValidation's async reaction. This change duplicates the fix on main in bsv-blockchain#714 to unblock CI on this PR before bsv-blockchain#714 is merged and rolled forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d Idxs field go-subtree v1.4.2 removed the public TxInpoints.Idxs [][]uint32 field in favour of a packed layout; external callers must construct via NewTxInpointsFromPacked. The reorg benchmark/test was still using the struct-literal Idxs field, failing golangci-lint typecheck.
… bulk build Review follow-ups on the parallel reorg optimization: - Extract partitionLongestChainMarks so the concurrent MarkTransactionsOnLongestChain(true/false) calls operate on provably disjoint, de-duplicated sets and can never race on the same UTXO record. Overlapping hashes resolve to mark(false), preserving the prior sequential last-write-wins semantics. - Bound bulkBuildSubtrees goroutine fan-out at GOMAXPROCS. - Remove dead finalizeBulkBuildSubtrees (never called). - Tests: partition disjointness/dedup (incl. heavy-overlap stress) and checkMarkNotOnLongestChain branch coverage + invalidation gating. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
[Critical] Race condition found at services/blockassembly/subtreeprocessor/SubtreeProcessor.go:3511 The loop variable idx is captured by reference in the goroutine closure. All goroutines will see the final value of idx after the loop completes, causing multiple goroutines to write to the same txMetas[idx] position and leaving other positions uninitialized. Fix: Add idx := idx inside the loop body before g.Go() to capture the loop variable, just like the existing h := hash capture on line 3504. |
|
Re: the flagged [Critical] race at The module is on Verified empirically: |
Follow-up commit
|
| Scale | Total | Alloc |
|---|---|---|
| 1K | 11 ms | 155 MB |
| 10K | 23 ms | 168 MB |
| 50K | 37 ms | 213 MB |
| 100K | 47 ms | 263 MB |
BenchmarkFullReorg 200k-tx reorg: 41.8 ms / 275 MB / 594k allocs. For contrast, the per-node addNode path this replaces costs 24.7 ms for 100k nodes alone (BenchmarkAddNodeSequential) — so the parallel bulk-build win is fully intact.
oskarszoon
left a comment
There was a problem hiding this comment.
First full review across consensus, Go, perf, and QA (my earlier note was just the rebase). The parallelization is consensus-faithful — reorg tx-set/ordering/conflicting-cascade/coinbase semantics are preserved vs the serial path, concurrent marks are race-safe by disjoint-set construction (partitionLongestChainMarks, stress-tested), and the full package passes -race. The perf win is real (100K-block reorg ~62ms, normal tx-ingress path untouched). On the deleted reverse-cascade dequeue: it's redundant, not a bug — stp.Remove routes those hashes into removeMap, which persists across the reorg and the dequeue drops them; but the comment at :3066-3068 still describes the deleted call and is now stale.
Two things before merge:
- P1 —
reorgBlockscan hang on cancellation (SubtreeProcessor.go:3360-3372). ThenewSubtreeChansend loop has noselect/ctx; if the consumer is cancelled while the loop is blocked on a full buffer during a >1000-subtree reorg, the reorg goroutine blocks forever. Your other sends (:669/731/874) all useselect { case chan <- …; case <-processorCtx.Done(): }— match that here. - P1 — reorg tests don't pin the correctness this PR claims.
reset_reorg_test.go:1074/1247assert only the chain-tip header; the tx-set checks aret.Logf("✅/❌")that never fail (a wrongbulkBuildSubtreesset passes green). EveryReorg()test uses emptySubtrees, so the newmoveBackBlockBulkBuilddeserialization path is never exercised by a correctness test, and there's no serial-vs-parallel equivalence test nor a double-spend-across-reorg case. Promote thet.Logfassertions torequire, add a reorg through real subtrees, and a competing-spend-across-switch case.
Cleanup: int16(baseIdx+i+1) at :2263 overflows past 32767 subtrees (use int32); delete the now-dead addMoveBackBlockNodesToSubtrees/flatIndexToSubtreeNode/etc.; fix the stale :3066-3068 comment; revert winningTxSet/losingTxSet to map[…]struct{}. Details in the review notes.
…, cleanup) Addresses oskarszoon's CHANGES_REQUESTED review on PR bsv-blockchain#526: - P1 cancellation hang: the reorg newSubtreeChan batch send/wait loops had no ctx awareness; a full buffer during a >1000-subtree reorg could block the reorg goroutine forever after the consumer was cancelled. Both the send and the response-wait are now wrapped in ctx-aware selects (matching the :669/731/874 pattern), preserving the send/receive overlap. - P1 test correctness: the Reorg tests asserted only via non-failing t.Logf("✅/❌") guarded by `if err == nil`, so a failing reorg (or a wrong bulkBuildSubtrees set) passed green. Promoted those three tests to require.NoError + require tx-set assertions (and fixed their headers so the reorg actually succeeds), and added TestSubtreeProcessor_ReorgThroughReal- Subtrees with two cases that exercise the moveBackBlockBulkBuild deserialization path through REAL serialized subtrees: tx recovery, and a competing-spend-across-switch (double-spend) case. - Cleanup: reverted winningTxSet/losingTxSet to map[chainhash.Hash]struct{}; deleted the now-dead addMoveBackBlockNodesToSubtrees / flatIndexToSubtreeNode (and their test); and fixed the stale reverseCascadedConflictingSet comment that described a deleted final dequeueDuringBlockMovement call. The int16 SubtreeIndex cast is left unchanged: SubtreeIndex is int16 in the external go-subtree TxInpoints type, so the cast matches the storage width and the three pre-existing sibling sites; a true int32 widening requires an upstream go-subtree change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review, @oskarszoon. Addressed in bfd429f: P1 — cancellation hang ( P1 — reorg tests don't pin correctness:
Cleanup:
On the Verification: full |
…eorg-performance-optimisation
| if len(inpoints) != n { | ||
| panic("SplitTxInpointsMap.ParallelBulkSetIfNotExists: len(inpoints) must equal len(hashes)") | ||
| } | ||
| if wasSet == nil || len(wasSet) != n { | ||
| panic("SplitTxInpointsMap.ParallelBulkSetIfNotExists: len(wasSet) must equal len(hashes)") | ||
| } | ||
|
|
There was a problem hiding this comment.
Fixed in the upcoming commit: ParallelBulkSetIfNotExists now explicitly sets wasSet[idx] = false on the existing-key path, so the slice no longer carries stale true values if a caller reuses it. The sole caller allocates a fresh slice each call, but this makes the documented contract hold unconditionally.
| for i := 0; i < count; i++ { | ||
| if isFirst && i == 0 { | ||
| // Skip first slot for coinbase in first subtree | ||
| continue | ||
| } | ||
| err = subtree.AddSubtreeNode(subtreepkg.Node{ | ||
| Hash: txHashes[idx+i], | ||
| Fee: 100, | ||
| SizeInBytes: 250, | ||
| }) | ||
| require.NoError(b, err) | ||
|
|
||
| if subtree.IsComplete() { | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed. buildAndStoreSubtrees now tracks a separate consumed counter and indexes txHashes[idx+consumed] instead of txHashes[idx+i], so the coinbase slot in the first subtree no longer drops txHashes[idx]. idx now advances by the number of tx hashes actually consumed.
| for i := 0; i < count; i++ { | ||
| if isFirst && i == 0 { | ||
| continue | ||
| } | ||
| err = subtree.AddSubtreeNode(subtreepkg.Node{ | ||
| Hash: txHashes[idx+i], | ||
| Fee: 100, | ||
| SizeInBytes: 250, | ||
| }) | ||
| require.NoError(t, err) | ||
|
|
||
| if subtree.IsComplete() { | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed with the same approach as buildAndStoreSubtrees: buildAndStoreSubtreesT now uses a consumed counter for txHashes indexing so no tx hash is skipped, keeping TestReorgBenchmarkBaseline operating on the configured tx count (verified passing).
| for i := 0; i < b.N; i++ { | ||
| start := time.Now() | ||
|
|
||
| err := store.MarkTransactionsOnLongestChain(ctx, markTrueHashes, true) | ||
| require.NoError(b, err) | ||
|
|
||
| err = store.MarkTransactionsOnLongestChain(ctx, markFalseHashes, false) | ||
| require.NoError(b, err) | ||
|
|
||
| elapsed := time.Since(start) | ||
| b.Logf(" Iter %d: %v (%.0f tx/sec total)", i+1, elapsed, | ||
| float64(tc.count*2)/elapsed.Seconds()) | ||
| } |
There was a problem hiding this comment.
Fixed. Removed the per-iteration time.Now/time.Since and b.Logf from the timed b.N loop in the sequential benchmark; throughput is reported from b.Elapsed()/ReportMetric after StopTimer, which is the authoritative measurement.
| for i := 0; i < b.N; i++ { | ||
| start := time.Now() | ||
|
|
||
| g, gCtx := errgroup.WithContext(ctx) | ||
| g.Go(func() error { | ||
| return store.MarkTransactionsOnLongestChain(gCtx, markTrueHashes, true) | ||
| }) | ||
| g.Go(func() error { | ||
| return store.MarkTransactionsOnLongestChain(gCtx, markFalseHashes, false) | ||
| }) | ||
| err := g.Wait() | ||
| require.NoError(b, err) | ||
|
|
||
| elapsed := time.Since(start) | ||
| b.Logf(" Iter %d: %v (%.0f tx/sec total)", i+1, elapsed, | ||
| float64(tc.count*2)/elapsed.Seconds()) | ||
| } |
There was a problem hiding this comment.
Fixed in the concurrent loop the same way (and also in the third b.N loop in BenchmarkMarkTransactionsOnLongestChain which had the identical pattern). The timed sections no longer include logging or time bookkeeping.
- ParallelBulkSetIfNotExists: explicitly set wasSet[idx]=false on existing keys so the result slice honors its contract under caller reuse - reorg benchmark helpers: fix off-by-one that dropped txHashes[idx] in the first subtree (coinbase slot); track consumed tx count explicitly - aerospike longest-chain bench: remove per-iteration time.Now/time.Since and b.Logf from the timed b.N loops so logging overhead is not measured Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| var wg sync.WaitGroup | ||
| for bIdx := uint16(0); bIdx < s.nrOfBuckets; bIdx++ { | ||
| indices := bucketIndices[bIdx] | ||
| if len(indices) == 0 { | ||
| continue | ||
| } | ||
| wg.Add(1) | ||
| go func(b *txInpointsBucket, indices []int) { | ||
| defer wg.Done() | ||
| b.mu.Lock() | ||
| for _, idx := range indices { | ||
| if !b.m.Has(hashes[idx]) { | ||
| b.m.Put(hashes[idx], inpoints[idx]) | ||
| wasSet[idx] = true | ||
| } else { | ||
| // Explicitly record the existing-key result so the contract | ||
| // (wasSet[i] reflects the outcome for hashes[i]) holds even if | ||
| // the caller reuses a slice with stale true values. | ||
| wasSet[idx] = false | ||
| } | ||
| } | ||
| b.mu.Unlock() | ||
| }(&s.buckets[bIdx], indices) | ||
| } | ||
| wg.Wait() |
There was a problem hiding this comment.
Good catch — fixed. ParallelBulkSetIfNotExists now uses a bounded pool of stride workers (numWorkers = min(GOMAXPROCS, nrOfBuckets)) instead of one goroutine per non-empty bucket, matching the existing SplitSwissMap.Clear / bulkBuildSubtrees pattern. Each worker strides over buckets, so a 100K-entry insert no longer spawns ~16K goroutines. Verified passing under -race.
| func (s *SplitTxInpointsMap) Clear() { | ||
| for _, syncedMap := range s.m { | ||
| syncedMap.Clear() | ||
| for i := uint16(0); i < s.nrOfBuckets; i++ { | ||
| b := &s.buckets[i] | ||
| b.mu.Lock() | ||
| b.m = swiss.NewMap[chainhash.Hash, *subtreepkg.TxInpoints](64) | ||
| b.mu.Unlock() | ||
| } |
There was a problem hiding this comment.
Fixed. SplitTxInpointsMap.Clear now calls b.m.Clear() in place (swiss.Map.Clear zeroes the control/group arrays and retains capacity) instead of allocating a fresh swiss.NewMap(64), matching SplitSwissMap.Clear. This is on the currentTxMap double-buffer reuse path, so it avoids repeated allocation + rehashing on refill. Double-buffer and map tests pass under -race.
- ParallelBulkSetIfNotExists: bound fan-out with a stride-worker pool (min(GOMAXPROCS, nrOfBuckets)) instead of one goroutine per non-empty bucket, avoiding ~16K goroutines on large bulk inserts - SplitTxInpointsMap.Clear: clear each bucket in place via swiss.Map.Clear to retain capacity on the currentTxMap double-buffer reuse path instead of reallocating a fresh map per Clear Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| func NewSplitTxInpointsMap(nrOfBuckets uint16) *SplitTxInpointsMap { | ||
| m := make(map[uint16]*txmap.SyncedMap[chainhash.Hash, *subtreepkg.TxInpoints], nrOfBuckets) | ||
| buckets := make([]txInpointsBucket, nrOfBuckets) | ||
| for i := uint16(0); i < nrOfBuckets; i++ { | ||
| m[i] = txmap.NewSyncedMap[chainhash.Hash, *subtreepkg.TxInpoints]() | ||
| buckets[i].m = swiss.NewMap[chainhash.Hash, *subtreepkg.TxInpoints](64) | ||
| } |
There was a problem hiding this comment.
This is a false positive. Go's make accepts any integer type for the length argument (per the spec, the size arguments must be of integer type), so make([]txInpointsBucket, nrOfBuckets) with a uint16 compiles fine — and this is pre-existing code (introduced in #385, unchanged by this PR). The package builds and go vet passes. No change needed.
| // Phase 1: Group indices by bucket (O(N), no locks) | ||
| bucketIndices := make([][]int, s.nrOfBuckets) | ||
| for i := 0; i < n; i++ { | ||
| bucket := txmap.Bytes2Uint16Buckets(hashes[i], s.nrOfBuckets) | ||
| bucketIndices[bucket] = append(bucketIndices[bucket], i) |
There was a problem hiding this comment.
Same false positive: make([][]int, s.nrOfBuckets) with a uint16 length is valid Go and compiles (the make size argument may be any integer type). This line is pre-existing in ParallelBulkSetIfNotExists and was not changed by this PR; the package builds and vets cleanly. No change needed.
| txMetas := make([]*meta.Data, len(markNotOnLongestChain)) | ||
| g, gCtx := errgroup.WithContext(ctx) | ||
| g.SetLimit(max(stp.settings.UtxoStore.MaxMinedRoutines, stp.settings.UtxoStore.GetBatcherSize*2)) |
There was a problem hiding this comment.
Addressed. Strictly, errgroup.SetLimit(0) does not panic (n==0 creates a zero-capacity semaphore; only n<0 disables the limit) — the failure mode would be a g.Go deadlock, and only if both MaxMinedRoutines and GetBatcherSize were 0 (defaults are 128, so max(...) is always >= 1). To match the convention used by every other SetLimit call in this file and surface the misconfiguration loudly, I switched this to util.SafeSetLimit, which panics at setup if the computed limit is 0 instead of deadlocking later.
- checkMarkNotOnLongestChain: use util.SafeSetLimit for the errgroup concurrency cap (matching every other SetLimit call in the file) so a 0 limit panics loudly at setup instead of risking a g.Go deadlock The two reported make([]T, uint16) 'won't compile' comments are false positives — make accepts any integer-type length and those lines are pre-existing; no change made there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
oskarszoon
left a comment
There was a problem hiding this comment.
Re-review @ d6e16b6 — approving; both blocking asks resolved.
- Cancellation hang fixed: both the send and response-wait loops in
reorgBlocks(SubtreeProcessor.go:3358-3391) now have ctx-awareselect/ctx.Done()guards on the sameprocessorCtxas the canonical sends. - Reorg tests pin correctness: the soft
t.Logf("✅/❌")checks are promoted to hardrequire, the silent else-branch error is fixed, andTestSubtreeProcessor_ReorgThroughRealSubtreesexercises the realmoveBackBlockBulkBuilddeserialization path including a competing-spend-across-switch (require.False(doubleSpend mined)/require.True(backOnly recovered)). Full package-racegreen, 246 tests.
The parallelization checks out: consensus-faithful (tx-set/ordering/cascade preserved, double-spend invariant enforced, merkle untouched), the concurrent longest-chain marks are provably disjoint at slice + UTXO-record level (partitionLongestChainMarks), and the perf win is real (~41ms for a 100K reorg vs ~25ms for the old per-node build alone). Copilot's ":3511 loopvar race" is a false positive (go 1.26 per-iteration scoping + explicit copy).
One non-blocking follow-up: the int16(...) casts at SubtreeProcessor.go:2198,2263 have no overflow guard. Not a consensus or corruption risk — SubtreeIndex is runtime-only (never serialized into any block/merkle/UTXO commitment) and the consumer re-validates with a linear-scan fallback, so a wrap only degrades O(1)→O(N) — and it's unreachable at the default 1M-item subtree size (~34B txs needed in one moved-back block). A cheap > math.MaxInt16 guard + warn (or safeconversion.IntToInt16) would harden a misconfigured tiny-subtree node, but it's not a blocker.
…y CI job The subtreeprocessor-heavy benchmark-compare job timed out (>30m). Root cause was test-only, not production code: the new BenchmarkAddNodeSequential matched the job's unanchored `BenchmarkAddNode` filter, ignored `-short`, and called runtime.GC() every b.N iteration. With tiny per-iteration timed work, b.N ramped into the thousands, so thousands of full GCs ran (~4min on a 16-core machine, far worse on the CI runner). The production map rewrite is unaffected and is actually slightly faster on BenchmarkAddNode. - Gate all 6 heavy reorg benchmarks behind testing.Short() (CI passes -short), so they skip in CI while remaining runnable locally. - Remove the per-iteration runtime.GC() from the 5 throughput benchmarks; it inflated wall-clock without affecting B/op accuracy. Kept the GC in BenchmarkReorgMemoryProfile where it establishes the MemStats baseline. - Anchor the subtreeprocessor-heavy bench_filter to the three intended benchmarks so future additions can't be swept in by substring matching. Heavy filter under -short now completes in ~1m (was projected >25m); all six benchmarks still run correctly without -short. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|



Summary
Profiling reorg operations at 100K transaction scale revealed three main bottlenecks: per-node lock acquisition in
SplitTxInpointsMap(O(N) sequential lock/unlock cycles), sequential UTXO marking wheremark(true)andmark(false)operate on disjoint hash sets but ran one after the other, and sequential subtree construction throughaddNode()calls.This PR addresses each bottleneck:
Replace
txmap.SyncedMapwithswiss.Map+sync.Mutexper bucket inSplitTxInpointsMap, matching the proven pattern already used bySplitSwissMap,SplitSyncedParentMap, and the block persister UTXO maps. Contiguous[]txInpointsBucketslice instead ofmap[uint16]*for cache-friendly access.Add
ParallelBulkSetIfNotExiststhat groups entries by bucket, then processes all non-empty buckets in parallel with a single lock acquisition per bucket. Reduces lock overhead from O(N) sequential to O(N/buckets) parallel. Wired intomoveBackBlockBulkBuildandprocessRemainderTxHashes.Introduce
bulkBuildSubtreesfor parallel subtree construction, replacing per-nodeaddNode()calls (which each acquire a mutex and checkIsComplete). Full subtrees are built concurrently viaerrgroup.Run
MarkTransactionsOnLongestChain(true)andMarkTransactionsOnLongestChain(false)concurrently viaerrgroup— they operate on disjoint hash sets.Parallel subtree announcements: batch send to
newSubtreeChan, then batch wait for responses, overlapping send/receive.Remove
hash_slice_pool.go—sync.Pooloverhead exceeded the allocation savings at reorg scale.Benchmark Results
Reorg performance (mock UTXO store, no race detector):
Behavioral change (not purely perf)
Beyond the performance work above, this PR also introduces new invalidation-gating logic that does not exist on
main. When the reorg is a single-block invalidation (len(moveBackBlocks) == 1 && len(moveForwardBlocks) == 0) and the moved-back block's header is markedInvalid,markNotOnLongestChainno longer unconditionally marks every passed tx as not-on-longest-chain. Instead, the newcheckMarkNotOnLongestChainfilters the set:BlockIDis the invalid block → marked not-on-longest-chain.blockchainClient.CheckBlockIsInCurrentChain, marking only those not on the current chain.This is a consensus-adjacent change to observable behavior during block invalidations, not a performance optimization. Flagging it explicitly so it is not overlooked under the
perf(...)title.Covered by new tests:
TestCheckMarkNotOnLongestChain,TestCheckMarkNotOnLongestChain_TxNotFound, andTestMarkNotOnLongestChain_InvalidGating.Test plan
-raceSplitTxInpointsMapunit tests (9 tests covering CRUD, concurrency, bucket distribution, bulk operations)🤖 Generated with Claude Code