perf(subtreeprocessor): eliminate SplitSwissMap contention in moveForwardBlock#1078
Conversation
…wardBlock A CPU profile of a foreign 420M-tx block on a 192-core scaling node showed moveForwardBlock plateauing at ~15M map-ops/s with only ~27 cores busy: CreateTransactionMap's ~375 concurrent subtree goroutines each spawned one goroutine per bucket, all contending on the 1024 bucket RWMutexes, and the ~440M Exists probes in processRemainderTxHashes each paid RLock/RUnlock readerCount atomics (11% of all CPU) on a map that is read-only by then. Changes: - bucketInserter: single-writer-per-bucket streaming insert pipeline for CreateTransactionMap. Each bucket is owned by exactly one worker, so the per-bucket mutex is never contended. Submits are batched one message per (submitter, worker) - the per-bucket variant showed 39% of samples in runtime.runqgrab scheduler churn. - SplitSwissMap.Freeze(): after the insert pipeline drains, the map is frozen; Exists skips the per-bucket RLock, writes fail loudly, Clear() un-freezes for pooled reuse across blocks. - SplitSwissMap buckets are slice-indexed instead of map[uint16]-keyed (the per-op Go-map lookup cost ~5% of insert CPU). Benchmarks (M3 Max, 16 cores; baselines in txmap_contention_benchmark_test.go): - insert 16M hashes/256 subtrees: 126M -> 227M tx/s (1.8x) - insert 40M hashes/256 subtrees: 147M -> 195M tx/s (1.3x) - Exists 16M, all cores: 15.2ns -> 6.0ns per probe (2.6x) - CreateTransactionMap e2e: 8.13ms -> 3.43ms (2.4x) The production win is larger: the old shape collapsed under contention at 192 cores in a way a 16-core box does not reproduce.
…est setup TestMinedThenSpendAllPrunes asserts that the pruner deletes a fully-spent mined tx in default pruning mode. A developer settings_local.conf setting pruner_utxoDefensiveEnabled=true leaks into local test runs and flips the assertion. Pin the flag in the two store setup helpers so the suite is deterministic regardless of local config; CI behaviour is unchanged (default is already false).
|
🤖 Claude Code Review Status: Complete Current Review: No new issues found. I focused on the concurrency-critical parts of this perf change and the design holds up:
The lock-free read’s safety rests on the freeze→read→clear ordering invariant (call-site discipline, not a type guarantee). This is well-documented in History:
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-12 11:10 UTC |
There was a problem hiding this comment.
Pull request overview
This PR targets a major scalability bottleneck in subtreeprocessor during foreign-block moveForwardBlock by reducing contention and lock overhead in the transaction-hash membership map used across Phase A (insertion) and Phase B (mass Exists() probing).
Changes:
- Introduces a
bucketInserterpipeline to enforce single-writer-per-bucket inserts duringCreateTransactionMap, avoiding RWMutex contention and excessive goroutine churn. - Adds
SplitSwissMap.Freeze()+ pooled reuse behavior (Clear()unfreezes) so Phase BExists()probes can skip per-bucket read locks when the map is known read-only. - Adds benchmarks and concurrency tests to validate and quantify the contention fixes; also pins
Pruner.UTXODefensiveEnabled=falsein SQL UTXO store tests to avoid developer-local config affecting assertions.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| stores/utxo/sql/unlock_batcher_postgres_test.go | Pins pruning mode in test settings to prevent local config from flipping pruning behavior. |
| stores/utxo/sql/sql_test.go | Same pruning-mode pin for shared SQL store test setup. |
| services/blockassembly/subtreeprocessor/txmap_contention_benchmark_test.go | Adds contention-reproduction + comparison benchmarks for Phase A inserts and Phase B Exists(). |
| services/blockassembly/subtreeprocessor/SubtreeProcessor.go | Wires in bucketInserter, joins inserter workers reliably, and freezes the map post-build. |
| services/blockassembly/subtreeprocessor/map.go | Switches buckets to slice-indexed storage and adds Freeze()/frozen fast path for Exists() plus write rejection. |
| services/blockassembly/subtreeprocessor/map_freeze_test.go | Adds tests for freeze correctness and lock-free concurrent read behavior. |
| services/blockassembly/subtreeprocessor/bucket_inserter.go | Implements the single-writer-per-bucket insert pipeline with batched submit semantics. |
| services/blockassembly/subtreeprocessor/bucket_inserter_test.go | Adds concurrency and misuse tests for the inserter. |
| services/blockassembly/subtreeprocessor/benchmark.go | Updates remainder benchmark to Freeze() the map like production. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ordishs
left a comment
There was a problem hiding this comment.
Reviewed the single-writer-per-bucket inserter, the Freeze/Clear lifecycle, and the slice-indexed buckets.
Verified independently:
- No production writes to the returned transactionMap after Freeze() — only Exists and Iter (reorg loop), so the freeze contract holds.
- The happens-before chain is sound: g.Wait() → closeAndWait() (channel close + worker join) → Freeze(), so lock-free readers see fully published bucket contents.
- closeAndWait() runs even when g.Wait() errors, and workers drain after an insert error, so no goroutine leak or submitter deadlock — both covered by tests.
- Ran go vet and the new concurrency tests under -race locally: pass.
Minor, non-blocking: submit()'s closed.Load() check only catches sequential misuse — a truly concurrent closeAndWait() would still panic on send-to-closed-channel. The doc comment states the contract correctly; a one-liner noting the guard's limits would help future readers. Also worth remembering that frozen reads make any future reader/Clear() overlap a data race rather than just a logic bug — the serialized block-movement loop is the invariant protecting this.
Benchmark methodology is solid (old and new shapes side by side, shared cases, -short gating) and the post-deploy profile re-capture plan is the right way to confirm the 192-core win.
…oroutine asserts Size the bucketInserter worker pool with runtime.GOMAXPROCS(0) instead of runtime.NumCPU(). The workers are CPU-bound, so the pool should track runnable parallelism; under a cgroup CPU quota (the normal container/k8s deployment) NumCPU() reports host cores and oversubscribes. Matches the existing GOMAXPROCS sizing in map.go and the errgroup limit in moveBlock. Also replace goroutine-local require.NoError with t.Errorf in the inserter and freeze tests: require's FailNow is only safe on the test goroutine.
Clear()'s parallel per-bucket clear sized its worker pool with runtime.NumCPU(), oversubscribing under a cgroup CPU quota. Use runtime.GOMAXPROCS(0) to match ParallelBulkSetIfNotExists and the bucketInserter sizing — the clear is bandwidth-bound and should track runnable parallelism, not host cores.
freemans13
left a comment
There was a problem hiding this comment.
Approve. Correct, well-tested, and unusually well-documented concurrency work — the single-writer-per-bucket insert pipeline plus the freeze-based lock-free read path is the right shape for the profiled 192-core contention collapse, and the happens-before reasoning (g.Wait → closeAndWait join → Freeze) is sound. Build passes; freeze lifecycle verified against all live callers and pool-reuse paths.
One non-blocking inline note on hardening the freeze invariant. Recommend landing with the production profile re-capture tracked as follow-up, and optionally splitting out the unrelated pruner-test pin.
|
…hase Bump go-tx-map to v1.4.0, which adds the Freeze/Clear lifecycle to the map interfaces, and use it in Block.Valid. Block.Valid is two-phase over b.txMap: checkDuplicateTransactions fills it (one Put per tx), then validOrderAndBlessed only reads it — one Get per tx plus one per parent, fanned out across every core. Each Get takes a per-bucket RWMutex.RLock, whose reader-counter atomic add/sub cache-line ping-pongs across cores and dominates the read phase on many-core validation nodes (the same contention bsv-blockchain#1078 removed in block assembly's moveForwardBlock). Block.Valid now calls b.txMap.Freeze() after the duplicate-check phase has completed and flushed, before validOrderAndBlessed. On the pooled in-memory SplitSwissMapUint64 this makes every read skip the RLock; releaseTxMap's PutTxMap -> Clear path un-freezes the map on its way back to the pool. checkDuplicateTransactions' internal errgroup.Wait is the happens-before edge guaranteeing all writes precede the Freeze. DiskTxMapUint64 (the disk-backed alternative, used when DiskMapDirs is set) implements the two new TxMap methods: Freeze is a write guard (Put returns txmap.ErrMapFrozen); Clear only lifts the freeze, since the disk map is single-use (released via Close, not pooled) and content-clearing is neither needed nor cheaply implementable over its multi-disk Badger + cuckoo layout. Also fixes a pre-existing data race (present on main, independent of this change but in the same file and surfaced by go test -race ./model): Stats reads dtmDiskShard.bytesWritten from the deferred releaseTxMap on the dup-detected early-return path while a writer goroutine is still incrementing it. Made bytesWritten an atomic.Int64.
…hase Bump go-tx-map to v1.4.0, which adds the Freeze/Clear lifecycle to the map interfaces, and use it in Block.Valid. Block.Valid is two-phase over b.txMap: checkDuplicateTransactions fills it (one Put per tx), then validOrderAndBlessed only reads it — one Get per tx plus one per parent, fanned out across every core. Each Get takes a per-bucket RWMutex.RLock, whose reader-counter atomic add/sub cache-line ping-pongs across cores and dominates the read phase on many-core validation nodes (the same contention bsv-blockchain#1078 removed in block assembly's moveForwardBlock). Block.Valid now calls b.txMap.Freeze() after the duplicate-check phase has completed and flushed, before validOrderAndBlessed. On the pooled in-memory SplitSwissMapUint64 this makes every read skip the RLock; releaseTxMap's PutTxMap -> Clear path un-freezes the map on its way back to the pool. checkDuplicateTransactions' internal errgroup.Wait is the happens-before edge guaranteeing all writes precede the Freeze. DiskTxMapUint64 (the mmap-backed alternative from bsv-blockchain#1053, used when DiskMapDirs is set) implements the two new TxMap methods: Freeze is a write guard (Put returns txmap.ErrMapFrozen); Clear only lifts the freeze, since the disk map is ephemeral and single-use (released via Close, not pooled). Its mmap reads are not made lock-free — the in-memory split map is the contended read path Freeze targets.



Problem
A CPU profile captured on the scaling cluster (scale-2 block-assembly, 192-core node) during a foreign 420M-tx block's
moveForwardBlockshowed both hot phases plateauing at ~15M map-ops/s with only ~27/192 cores busy:CreateTransactionMap: ~375 concurrent per-subtree goroutines each spawned one goroutine per non-empty bucket — all contending on the same 1024 bucketRWMutexes (mutex queueing + 375×1024 goroutine churn).processRemainderTxHashes/dequeueDuringBlockMovement: ~440MExists()probes each payingRLock/RUnlock— 11% of all CPU wassync/atomic.(*Int32).Add(readerCount cache-line ping-pong) on a map that is strictly read-only during those phases.This is the dominant constant cost behind non-mining nodes lagging the miner by a median ~55s per block in "mining on top of the new tip".
Changes
bucketInserter(new): single-writer-per-bucket streaming insert pipeline forCreateTransactionMap. Each bucket is exclusively owned by one worker; submits are batched one message per (submitter, worker). The per-bucket handoff variant left 39% of profile samples inruntime.runqgrabscheduler churn — batching removed it.SplitSwissMap.Freeze(): called once the insert pipeline drains (the errgroup join + worker join is the happens-before edge).Existsthen skips the per-bucket RLock;Put/PutMultiBucketfail loudly on a frozen map;Clear()un-freezes for pooled reuse across blocks.m/muare[]*swiss.Map/[]*sync.RWMutexinstead ofmap[uint16]…— the per-op Go-map lookup cost ~5% of insert CPU.TDD: baseline benchmarks were written and recorded first (
txmap_contention_benchmark_test.go), thenmap_freeze_test.go/bucket_inserter_test.gored→green, then the wire-in. All new concurrency tests run under-race.Benchmarks (M3 Max, 16 cores)
CreateTransactionMape2eA 16-core desktop does not reproduce the 192-core contention collapse (~15M ops/s in production vs ~130M+ locally), so the production win should be substantially larger. Post-deploy verification plan: re-capture a 30s CPU profile on a scaling node during a foreign-block
moveForwardBlockand compare the"processing subtrees into transaction map DONE"/"remainder tx hashes DONE"log durations against the current 28s/30s.Also includes a standalone commit pinning
pruner_utxoDefensiveEnabled=falsein the sql utxostore test setup —TestMinedThenSpendAllPrunesasserts default-mode pruning and flipped when a developer'ssettings_local.confset the flag; CI is unchanged.Verification
go test -race ./services/blockassembly/...— pass-race -tags testtxmetacache, all packages) — pass after the pruner-test pingo vet,golangci-lint run(0 issues),staticcheck— cleanGOMAXPROCS=4(insert 1.1× faster, Exists strictly fewer ops/probe)