Skip to content

perf(subtreeprocessor): eliminate SplitSwissMap contention in moveForwardBlock#1078

Merged
icellan merged 4 commits into
bsv-blockchain:mainfrom
icellan:perf/splitswissmap-contention
Jun 12, 2026
Merged

perf(subtreeprocessor): eliminate SplitSwissMap contention in moveForwardBlock#1078
icellan merged 4 commits into
bsv-blockchain:mainfrom
icellan:perf/splitswissmap-contention

Conversation

@icellan

@icellan icellan commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

A CPU profile captured on the scaling cluster (scale-2 block-assembly, 192-core node) during a foreign 420M-tx block's moveForwardBlock showed both hot phases plateauing at ~15M map-ops/s with only ~27/192 cores busy:

  • CreateTransactionMap: ~375 concurrent per-subtree goroutines each spawned one goroutine per non-empty bucket — all contending on the same 1024 bucket RWMutexes (mutex queueing + 375×1024 goroutine churn).
  • processRemainderTxHashes / dequeueDuringBlockMovement: ~440M Exists() probes each paying RLock/RUnlock11% of all CPU was sync/atomic.(*Int32).Add (readerCount cache-line ping-pong) on a map that is strictly read-only during those phases.

This is the dominant constant cost behind non-mining nodes lagging the miner by a median ~55s per block in "mining on top of the new tip".

Changes

  • bucketInserter (new): single-writer-per-bucket streaming insert pipeline for CreateTransactionMap. Each bucket is exclusively owned by one worker; submits are batched one message per (submitter, worker). The per-bucket handoff variant left 39% of profile samples in runtime.runqgrab scheduler churn — batching removed it.
  • SplitSwissMap.Freeze(): called once the insert pipeline drains (the errgroup join + worker join is the happens-before edge). Exists then skips the per-bucket RLock; Put/PutMultiBucket fail loudly on a frozen map; Clear() un-freezes for pooled reuse across blocks.
  • Slice-indexed buckets: m/mu are []*swiss.Map/[]*sync.RWMutex instead of map[uint16]… — the per-op Go-map lookup cost ~5% of insert CPU.

TDD: baseline benchmarks were written and recorded first (txmap_contention_benchmark_test.go), then map_freeze_test.go / bucket_inserter_test.go red→green, then the wire-in. All new concurrency tests run under -race.

Benchmarks (M3 Max, 16 cores)

Benchmark Before After Delta
PhaseA insert 4M / 64 subtrees 165M tx/s 279M tx/s 1.7×
PhaseA insert 16M / 256 subtrees 126M tx/s 227M tx/s 1.8×
PhaseA insert 40M / 256 subtrees 147M tx/s 195M tx/s 1.3×
PhaseB Exists 16M (all cores) 15.2 ns/op 6.0 ns/op 2.6×
CreateTransactionMap e2e 8.13 ms 3.43 ms 2.4×

A 16-core desktop does not reproduce the 192-core contention collapse (~15M ops/s in production vs ~130M+ locally), so the production win should be substantially larger. Post-deploy verification plan: re-capture a 30s CPU profile on a scaling node during a foreign-block moveForwardBlock and compare the "processing subtrees into transaction map DONE" / "remainder tx hashes DONE" log durations against the current 28s/30s.

Also includes a standalone commit pinning pruner_utxoDefensiveEnabled=false in the sql utxostore test setup — TestMinedThenSpendAllPrunes asserts default-mode pruning and flipped when a developer's settings_local.conf set the flag; CI is unchanged.

Verification

  • go test -race ./services/blockassembly/... — pass
  • Full repo unit suite (-race -tags testtxmetacache, all packages) — pass after the pruner-test pin
  • go vet, golangci-lint run (0 issues), staticcheck — clean
  • No regression at GOMAXPROCS=4 (insert 1.1× faster, Exists strictly fewer ops/probe)

icellan added 2 commits June 12, 2026 11:50
…wardBlock

A CPU profile of a foreign 420M-tx block on a 192-core scaling node showed
moveForwardBlock plateauing at ~15M map-ops/s with only ~27 cores busy:
CreateTransactionMap's ~375 concurrent subtree goroutines each spawned one
goroutine per bucket, all contending on the 1024 bucket RWMutexes, and the
~440M Exists probes in processRemainderTxHashes each paid RLock/RUnlock
readerCount atomics (11% of all CPU) on a map that is read-only by then.

Changes:
- bucketInserter: single-writer-per-bucket streaming insert pipeline for
  CreateTransactionMap. Each bucket is owned by exactly one worker, so the
  per-bucket mutex is never contended. Submits are batched one message per
  (submitter, worker) - the per-bucket variant showed 39% of samples in
  runtime.runqgrab scheduler churn.
- SplitSwissMap.Freeze(): after the insert pipeline drains, the map is
  frozen; Exists skips the per-bucket RLock, writes fail loudly, Clear()
  un-freezes for pooled reuse across blocks.
- SplitSwissMap buckets are slice-indexed instead of map[uint16]-keyed
  (the per-op Go-map lookup cost ~5% of insert CPU).

Benchmarks (M3 Max, 16 cores; baselines in txmap_contention_benchmark_test.go):
- insert 16M hashes/256 subtrees: 126M -> 227M tx/s (1.8x)
- insert 40M hashes/256 subtrees: 147M -> 195M tx/s (1.3x)
- Exists 16M, all cores: 15.2ns -> 6.0ns per probe (2.6x)
- CreateTransactionMap e2e: 8.13ms -> 3.43ms (2.4x)
The production win is larger: the old shape collapsed under contention at
192 cores in a way a 16-core box does not reproduce.
…est setup

TestMinedThenSpendAllPrunes asserts that the pruner deletes a fully-spent
mined tx in default pruning mode. A developer settings_local.conf setting
pruner_utxoDefensiveEnabled=true leaks into local test runs and flips the
assertion. Pin the flag in the two store setup helpers so the suite is
deterministic regardless of local config; CI behaviour is unchanged
(default is already false).
Comment thread services/blockassembly/subtreeprocessor/bucket_inserter.go
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

Current Review: No new issues found.

I focused on the concurrency-critical parts of this perf change and the design holds up:

  • bucketInserter (single-writer-per-bucket): worker w exclusively owns buckets {b : b%numWorkers==w}, so the per-bucket mutex inside PutMultiBucket is taken but never contended. Concurrent submit() → channel sends are safe; workers drain until close, so submitters cannot deadlock (the frozen-error drain path is explicitly tested). closeAndWait is idempotent via CompareAndSwap.
  • Freeze() + lock-free Exists: the happens-before edge is sound — g.Wait() joins all submitters, closeAndWait’s wg.Wait() joins all writer workers, then Freeze() stores the atomic. A reader observing frozen==true transitively sees every bucket write. The non-frozen Exists path still RLocks, so existing concurrent-with-writer callers are unaffected.
  • Clear() un-freeze for pool reuse: correct given the established single-goroutine, sequential block-movement lifecycle (read phase of block N completes before block N+1’s CreateTransactionMapClear). Writes after freeze fail loudly rather than silently corrupting the lock-free read path.
  • Slice-indexed buckets: safe — the bucket router was already the indexing contract; out-of-range would have panicked under the old map[uint16] shape too.
  • sql test pins (pruner_utxoDefensiveEnabled=false): sensible determinism fix; CI default is already false, so no behavior change.

The lock-free read’s safety rests on the freeze→read→clear ordering invariant (call-site discipline, not a type guarantee). This is well-documented in map.go and already raised in @freemans13’s open thread — worth considering the suggested debug assertion/metric on rejected frozen writes as future insurance, but non-blocking.

History:

  • ✅ Fixed: Clear() worker pool now sized by GOMAXPROCS(0) instead of NumCPU() (commit 36ff3dd9f) — resolves my earlier consistency note.
  • ✅ Addressed: test goroutine requiret.Errorf fix and bucketInserter GOMAXPROCS sizing landed in follow-up commits.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1078 (9cb590e)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 132
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.813µ 1.792µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.69n 61.64n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 61.63n 61.80n ~ 0.100
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.60n 61.65n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 30.79n 31.00n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 53.10n 52.26n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 107.1n 105.6n ~ 0.100
MiningCandidate_Stringify_Short-4 261.4n 262.4n ~ 0.800
MiningCandidate_Stringify_Long-4 1.903µ 1.906µ ~ 0.300
MiningSolution_Stringify-4 971.6n 971.1n ~ 0.700
BlockInfo_MarshalJSON-4 1.790µ 1.789µ ~ 1.000
NewFromBytes-4 125.5n 123.7n ~ 0.100
AddTxBatchColumnar_Validation-4 2.414µ 2.370µ ~ 0.700
OffsetValidationLoop-4 725.7n 719.9n ~ 0.400
Mine_EasyDifficulty-4 56.54µ 55.60µ ~ 1.000
Mine_WithAddress-4 5.797µ 5.818µ ~ 1.000
DirectSubtreeAdd/4_per_subtree-4 58.30n 59.21n ~ 0.400
DirectSubtreeAdd/64_per_subtree-4 30.11n 29.89n ~ 0.400
DirectSubtreeAdd/256_per_subtree-4 29.07n 28.89n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 28.21n 28.08n ~ 0.700
DirectSubtreeAdd/2048_per_subtree-4 27.69n 27.57n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 239.4n 240.7n ~ 0.300
SubtreeProcessorAdd/64_per_subtree-4 236.1n 236.5n ~ 0.400
SubtreeProcessorAdd/256_per_subtree-4 236.4n 238.1n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 235.4n 236.3n ~ 1.000
SubtreeProcessorAdd/2048_per_subtree-4 230.5n 231.1n ~ 0.700
SubtreeProcessorRotate/4_per_subtree-4 234.8n 232.2n ~ 0.100
SubtreeProcessorRotate/64_per_subtree-4 232.8n 231.0n ~ 0.400
SubtreeProcessorRotate/256_per_subtree-4 232.6n 229.5n ~ 0.700
SubtreeProcessorRotate/1024_per_subtree-4 232.9n 232.9n ~ 0.700
SubtreeNodeAddOnly/4_per_subtree-4 56.26n 54.13n ~ 0.100
SubtreeNodeAddOnly/64_per_subtree-4 34.43n 34.24n ~ 0.100
SubtreeNodeAddOnly/256_per_subtree-4 33.40n 33.34n ~ 1.000
SubtreeNodeAddOnly/1024_per_subtree-4 32.62n 32.63n ~ 0.800
SubtreeCreationOnly/4_per_subtree-4 114.9n 113.6n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 399.0n 393.5n ~ 0.100
SubtreeCreationOnly/256_per_subtree-4 1.337µ 1.431µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.380µ 4.680µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 7.960µ 8.198µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 230.8n 232.7n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 230.7n 233.7n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 9.040m 10.364m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 13.17m 13.09m ~ 1.000
ParallelGetAndSetIfNotExists/50k_nodes-4 15.96m 16.02m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 19.28m 19.54m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 9.278m 11.542m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 13.40m 15.98m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 24.31m 21.45m ~ 0.200
SequentialGetAndSetIfNotExists/100k_nodes-4 29.88m 29.04m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 9.368m 12.319m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 16.87m 19.28m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 20.91m 21.48m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 12.94m 13.11m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 16.36m 16.79m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 61.06m 56.28m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.567µ 3.614µ ~ 0.700
DiskTxMap_SetIfNotExists_Parallel-4 3.351µ 3.384µ ~ 1.000
DiskTxMap_ExistenceOnly-4 326.7n 322.9n ~ 0.700
Queue-4 188.9n 190.7n ~ 0.400
AtomicPointer-4 4.730n 4.483n ~ 0.700
TxMapSetIfNotExists-4 52.92n 52.81n ~ 1.000
TxMapSetIfNotExistsDuplicate-4 40.01n 40.29n ~ 0.200
ChannelSendReceive-4 584.5n 579.5n ~ 0.100
BlockAssembler_AddTx-4 0.02933n 0.03126n ~ 0.400
AddNode-4 11.05 11.11 ~ 0.400
AddNodeWithMap-4 11.31 11.59 ~ 0.700
CalcBlockWork-4 473.2n 469.7n ~ 0.700
CalculateWork-4 641.9n 643.4n ~ 1.000
CheckOldBlockIDs/on-chain-prefetch/1000-4 63.91µ 57.75µ ~ 0.400
CheckOldBlockIDs/off-chain-prefetch/1000-4 47.99µ 48.40µ ~ 0.200
CheckOldBlockIDs/on-chain-prefetch/10000-4 428.6µ 436.9µ ~ 0.400
CheckOldBlockIDs/off-chain-prefetch/10000-4 343.4µ 350.9µ ~ 0.200
BuildBlockLocatorString_Helpers/Size_10-4 1.393µ 1.500µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.26µ 14.29µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 130.6µ 136.5µ ~ 0.100
CatchupWithHeaderCache-4 104.6m 104.7m ~ 0.700
_prepareTxsPerLevel-4 409.9m 432.0m ~ 0.200
_prepareTxsPerLevelOrdered-4 3.858m 4.402m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 425.0m 421.5m ~ 0.200
_prepareTxsPerLevel_Comparison/Optimized-4 3.824m 4.015m ~ 0.200
SubtreeSizes/10k_tx_4_per_subtree-4 1.340m 1.322m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 313.5µ 312.7µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 75.46µ 74.62µ ~ 0.400
SubtreeSizes/10k_tx_256_per_subtree-4 18.57µ 18.63µ ~ 0.700
SubtreeSizes/10k_tx_512_per_subtree-4 9.249µ 9.246µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.584µ 4.613µ ~ 0.200
SubtreeSizes/10k_tx_2k_per_subtree-4 2.308µ 2.315µ ~ 1.000
BlockSizeScaling/10k_tx_64_per_subtree-4 73.27µ 73.00µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 18.44µ 18.34µ ~ 0.700
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.623µ 4.595µ ~ 1.000
BlockSizeScaling/50k_tx_64_per_subtree-4 386.8µ 399.2µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 92.66µ 92.91µ ~ 0.400
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.08µ 22.73µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 155.1µ 156.5µ ~ 0.400
SubtreeAllocations/small_subtrees_data_fetch-4 163.0µ 161.1µ ~ 0.100
SubtreeAllocations/small_subtrees_full_validation-4 324.3µ 321.0µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 9.258µ 9.193µ ~ 0.400
SubtreeAllocations/medium_subtrees_data_fetch-4 9.642µ 9.595µ ~ 1.000
SubtreeAllocations/medium_subtrees_full_validation-4 18.72µ 18.84µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.207µ 2.205µ ~ 1.000
SubtreeAllocations/large_subtrees_data_fetch-4 2.333µ 2.351µ ~ 1.000
SubtreeAllocations/large_subtrees_full_validation-4 4.676µ 4.681µ ~ 1.000
_BufferPoolAllocation/16KB-4 3.827µ 3.838µ ~ 0.700
_BufferPoolAllocation/32KB-4 8.821µ 8.027µ ~ 0.700
_BufferPoolAllocation/64KB-4 17.25µ 15.63µ ~ 0.400
_BufferPoolAllocation/128KB-4 33.52µ 26.53µ ~ 0.100
_BufferPoolAllocation/512KB-4 111.7µ 112.6µ ~ 0.400
_BufferPoolConcurrent/32KB-4 18.70µ 18.84µ ~ 0.700
_BufferPoolConcurrent/64KB-4 29.99µ 29.14µ ~ 1.000
_BufferPoolConcurrent/512KB-4 146.0µ 150.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/16KB-4 629.2µ 679.8µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 613.5µ 604.1µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/64KB-4 651.2µ 616.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 656.7µ 608.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 663.4µ 635.7µ ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.38m 36.83m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.68m 36.73m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.53m 36.47m ~ 1.000
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.16m 36.82m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 35.95m 36.39m ~ 0.100
_PooledVsNonPooled/Pooled-4 831.1n 838.5n ~ 0.100
_PooledVsNonPooled/NonPooled-4 8.429µ 8.009µ ~ 0.100
_MemoryFootprint/Current_512KB_32concurrent-4 6.720µ 6.756µ ~ 1.000
_MemoryFootprint/Proposed_32KB_32concurrent-4 10.30µ 10.94µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 10.20µ 10.06µ ~ 0.400
StoreBlock_Sequential/BelowCSVHeight-4 329.9µ 327.6µ ~ 0.200
StoreBlock_Sequential/AboveCSVHeight-4 330.7µ 335.1µ ~ 0.200
GetUtxoHashes-4 256.6n 257.6n ~ 1.000
GetUtxoHashes_ManyOutputs-4 52.03µ 49.48µ ~ 0.100
_NewMetaDataFromBytes-4 224.2n 222.7n ~ 0.100
_Bytes-4 425.2n 421.1n ~ 0.700
_MetaBytes-4 136.2n 137.1n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-06-12 11:10 UTC

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR targets a major scalability bottleneck in subtreeprocessor during foreign-block moveForwardBlock by reducing contention and lock overhead in the transaction-hash membership map used across Phase A (insertion) and Phase B (mass Exists() probing).

Changes:

  • Introduces a bucketInserter pipeline to enforce single-writer-per-bucket inserts during CreateTransactionMap, avoiding RWMutex contention and excessive goroutine churn.
  • Adds SplitSwissMap.Freeze() + pooled reuse behavior (Clear() unfreezes) so Phase B Exists() probes can skip per-bucket read locks when the map is known read-only.
  • Adds benchmarks and concurrency tests to validate and quantify the contention fixes; also pins Pruner.UTXODefensiveEnabled=false in SQL UTXO store tests to avoid developer-local config affecting assertions.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
stores/utxo/sql/unlock_batcher_postgres_test.go Pins pruning mode in test settings to prevent local config from flipping pruning behavior.
stores/utxo/sql/sql_test.go Same pruning-mode pin for shared SQL store test setup.
services/blockassembly/subtreeprocessor/txmap_contention_benchmark_test.go Adds contention-reproduction + comparison benchmarks for Phase A inserts and Phase B Exists().
services/blockassembly/subtreeprocessor/SubtreeProcessor.go Wires in bucketInserter, joins inserter workers reliably, and freezes the map post-build.
services/blockassembly/subtreeprocessor/map.go Switches buckets to slice-indexed storage and adds Freeze()/frozen fast path for Exists() plus write rejection.
services/blockassembly/subtreeprocessor/map_freeze_test.go Adds tests for freeze correctness and lock-free concurrent read behavior.
services/blockassembly/subtreeprocessor/bucket_inserter.go Implements the single-writer-per-bucket insert pipeline with batched submit semantics.
services/blockassembly/subtreeprocessor/bucket_inserter_test.go Adds concurrency and misuse tests for the inserter.
services/blockassembly/subtreeprocessor/benchmark.go Updates remainder benchmark to Freeze() the map like production.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/blockassembly/subtreeprocessor/SubtreeProcessor.go Outdated
Comment thread services/blockassembly/subtreeprocessor/bucket_inserter_test.go
Comment thread services/blockassembly/subtreeprocessor/bucket_inserter_test.go
Comment thread services/blockassembly/subtreeprocessor/map_freeze_test.go

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the single-writer-per-bucket inserter, the Freeze/Clear lifecycle, and the slice-indexed buckets.

Verified independently:

  • No production writes to the returned transactionMap after Freeze() — only Exists and Iter (reorg loop), so the freeze contract holds.
  • The happens-before chain is sound: g.Wait() → closeAndWait() (channel close + worker join) → Freeze(), so lock-free readers see fully published bucket contents.
  • closeAndWait() runs even when g.Wait() errors, and workers drain after an insert error, so no goroutine leak or submitter deadlock — both covered by tests.
  • Ran go vet and the new concurrency tests under -race locally: pass.

Minor, non-blocking: submit()'s closed.Load() check only catches sequential misuse — a truly concurrent closeAndWait() would still panic on send-to-closed-channel. The doc comment states the contract correctly; a one-liner noting the guard's limits would help future readers. Also worth remembering that frozen reads make any future reader/Clear() overlap a data race rather than just a logic bug — the serialized block-movement loop is the invariant protecting this.

Benchmark methodology is solid (old and new shapes side by side, shared cases, -short gating) and the post-deploy profile re-capture plan is the right way to confirm the 192-core win.

…oroutine asserts

Size the bucketInserter worker pool with runtime.GOMAXPROCS(0) instead of
runtime.NumCPU(). The workers are CPU-bound, so the pool should track
runnable parallelism; under a cgroup CPU quota (the normal container/k8s
deployment) NumCPU() reports host cores and oversubscribes. Matches the
existing GOMAXPROCS sizing in map.go and the errgroup limit in moveBlock.

Also replace goroutine-local require.NoError with t.Errorf in the inserter
and freeze tests: require's FailNow is only safe on the test goroutine.
Comment thread services/blockassembly/subtreeprocessor/map.go Outdated
Clear()'s parallel per-bucket clear sized its worker pool with
runtime.NumCPU(), oversubscribing under a cgroup CPU quota. Use
runtime.GOMAXPROCS(0) to match ParallelBulkSetIfNotExists and the
bucketInserter sizing — the clear is bandwidth-bound and should track
runnable parallelism, not host cores.
Comment thread services/blockassembly/subtreeprocessor/map.go

@freemans13 freemans13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Correct, well-tested, and unusually well-documented concurrency work — the single-writer-per-bucket insert pipeline plus the freeze-based lock-free read path is the right shape for the profiled 192-core contention collapse, and the happens-before reasoning (g.Wait → closeAndWait join → Freeze) is sound. Build passes; freeze lifecycle verified against all live callers and pool-reuse paths.

One non-blocking inline note on hardening the freeze invariant. Recommend landing with the production profile re-capture tracked as follow-up, and optionally splitting out the unrelated pruner-test pin.

@sonarqubecloud

Copy link
Copy Markdown

@icellan icellan merged commit 840a8a1 into bsv-blockchain:main Jun 12, 2026
34 checks passed
icellan added a commit to icellan/teranode that referenced this pull request Jun 12, 2026
…hase

Bump go-tx-map to v1.4.0, which adds the Freeze/Clear lifecycle to the map
interfaces, and use it in Block.Valid.

Block.Valid is two-phase over b.txMap: checkDuplicateTransactions fills it
(one Put per tx), then validOrderAndBlessed only reads it — one Get per tx
plus one per parent, fanned out across every core. Each Get takes a per-bucket
RWMutex.RLock, whose reader-counter atomic add/sub cache-line ping-pongs across
cores and dominates the read phase on many-core validation nodes (the same
contention bsv-blockchain#1078 removed in block assembly's moveForwardBlock).

Block.Valid now calls b.txMap.Freeze() after the duplicate-check phase has
completed and flushed, before validOrderAndBlessed. On the pooled in-memory
SplitSwissMapUint64 this makes every read skip the RLock; releaseTxMap's
PutTxMap -> Clear path un-freezes the map on its way back to the pool.
checkDuplicateTransactions' internal errgroup.Wait is the happens-before edge
guaranteeing all writes precede the Freeze.

DiskTxMapUint64 (the disk-backed alternative, used when DiskMapDirs is set)
implements the two new TxMap methods: Freeze is a write guard (Put returns
txmap.ErrMapFrozen); Clear only lifts the freeze, since the disk map is
single-use (released via Close, not pooled) and content-clearing is neither
needed nor cheaply implementable over its multi-disk Badger + cuckoo layout.

Also fixes a pre-existing data race (present on main, independent of this
change but in the same file and surfaced by go test -race ./model): Stats reads
dtmDiskShard.bytesWritten from the deferred releaseTxMap on the dup-detected
early-return path while a writer goroutine is still incrementing it. Made
bytesWritten an atomic.Int64.
icellan added a commit to icellan/teranode that referenced this pull request Jun 12, 2026
…hase

Bump go-tx-map to v1.4.0, which adds the Freeze/Clear lifecycle to the map
interfaces, and use it in Block.Valid.

Block.Valid is two-phase over b.txMap: checkDuplicateTransactions fills it
(one Put per tx), then validOrderAndBlessed only reads it — one Get per tx
plus one per parent, fanned out across every core. Each Get takes a per-bucket
RWMutex.RLock, whose reader-counter atomic add/sub cache-line ping-pongs across
cores and dominates the read phase on many-core validation nodes (the same
contention bsv-blockchain#1078 removed in block assembly's moveForwardBlock).

Block.Valid now calls b.txMap.Freeze() after the duplicate-check phase has
completed and flushed, before validOrderAndBlessed. On the pooled in-memory
SplitSwissMapUint64 this makes every read skip the RLock; releaseTxMap's
PutTxMap -> Clear path un-freezes the map on its way back to the pool.
checkDuplicateTransactions' internal errgroup.Wait is the happens-before edge
guaranteeing all writes precede the Freeze.

DiskTxMapUint64 (the mmap-backed alternative from bsv-blockchain#1053, used when DiskMapDirs
is set) implements the two new TxMap methods: Freeze is a write guard (Put
returns txmap.ErrMapFrozen); Clear only lifts the freeze, since the disk map is
ephemeral and single-use (released via Close, not pooled). Its mmap reads are
not made lock-free — the in-memory split map is the contended read path Freeze
targets.
@icellan icellan self-assigned this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants