fix(utxo/aerospike): stop caller-side goroutine/connection leak in batchers by oskarszoon · Pull Request #1025 · bsv-blockchain/teranode

oskarszoon · 2026-06-02T19:33:08Z

Problem

A production node (bsva-ovh-teranode-eu-2) accumulated 13,693 goroutines — 8,664 parked in (*Store).get's select and 4,096 in validator errgroups — that persisted after the underlying Aerospike wedge cleared and client_connections had returned to 0. A leak that survives the wedge is not the wedge; it's a caller-side bug holding the resources.

Root cause (teranode-side, not the v8 client)

Orphaned completion channels on panic. go-batcher recovers panics raised inside the batch dispatch fn (dispatchAndRecord wraps b.fn(batch) in defer recover()). A panic part-way through a sendXxxBatch — e.g. an unchecked type assertion in getTxFromBins on a malformed/missing bin — left every not-yet-signalled per-item completion channel orphaned. The worker survived; the submitters parked forever, because the contexts threaded down from legacy sync / validation carry no deadline, so the get select's ctx.Done() arm never fires.
Retry amplification. sendGetBatch wrapped the already-retrying v8 batch call (MaxRetries within TotalTimeout=5m) in another 3× retry loop, stacking the worst-case stall to ~3× TotalTimeout (~15m) and growing the submitter backlog faster than it drained under sustained server slowness.

The v8 client itself bounds the batch read at TotalTimeout (5m) and releases the connection — verified against v8.7.1-bsv3. No v8 change is required to stop this leak. (One latent v8 robustness bug — connection.go resets the socket deadline on every partial read, so SocketTimeout is per-syscall and a slow-drip server evades it; only bites callers with TotalTimeout=0, which teranode is not — flagged separately, out of scope here.)

Changes

Panic safety net — signalBatchPanic + trySignal (new batch_completion.go); a recover-defer on all 7 dispatch fns (sendGetBatch, sendOutpointBatch, sendStoreBatch, sendSpendBatchLua, sendIncrementBatch, sendSetDAHBatch, setLockedBatch) re-signals every per-item channel on panic. Non-blocking, so it never double-delivers or wedges the worker.
Bounded submitter waits (keystone) — new cached Store.batcherWait (= batch policy TotalTimeout + 30s grace) on (*Store).get, PreviousOutputsDecorate, Create, SetDAHForChildRecords, SetLocked. A wedged batcher now releases the caller after the bound instead of for the life of the process.
Drop nested retry in sendGetBatch — rely on the v8 policy's own retries.
Per-fn correctness bugs:
- setLockedBatch: os.Exit(1) on a key error → per-item error; signal the previously-silent missing-LuaSuccess-bin path; bound the same-pool child-record recursion with a timeout — this converts a potential pool-exhaustion hang into a bounded ServiceUnavailable; fully eliminating the recursion is tracked in setLockedBatch: same-pool child-record recursion can exhaust the lockedBatcher worker pool #1033.
- sendSetDAHBatch: signal + NOOP placeholder on key-skip (was a nil batchRecords slot → nil-deref panic + orphaned errCh); fix a %s-without-arg format bug.
- sendIncrementBatch: fix index desync after a key-creation skip (placeholder + handled[]).
- sendSpendBatchLua / executeSpendBatch: replace double-signal on already-completed buffered-1 items with non-blocking trySignal.
- sendOutpointBatch: bounds-check the previous-output index.
Route BatchDecorate's BatchOperate through the existing batchOperateFn test seam.

Verification

go build ./..., go vet, staticcheck, golangci-lint (0 issues), gofmt all clean. New unit tests pass under -race. Live-container Aerospike integration green: increment, setDAH/drift, decorate, split-tx (SetLocked child recursion), TestAerospike, spend (dup-spender / multi-record / unspend / nil-panic). Full package unit suite passes.

Review follow-ups (@ordishs)

Applied: Create's errCh is now buffered-1, matching every other completion channel (it was the lone unbuffered one; the new bounded wait meant Create could depart before sendStoreBatch sends).
Per-dispatch-fn panic-guard unit tests added (batch_dispatch_leak_test.go).
Tracked as follow-up: full elimination of setLockedBatch's same-pool child recursion — setLockedBatch: same-pool child-record recursion can exhaust the lockedBatcher worker pool #1033.

…tchers A production node accumulated 13,693 goroutines (8,664 parked in (*Store).get's select, 4,096 in validator errgroups) that persisted after the underlying aerospike wedge cleared and client_connections returned to 0. Root cause is teranode-side, not the v8 client: - go-batcher recovers panics raised in the batch dispatch fn, so a panic part-way through a sendXxxBatch (e.g. an unchecked type assertion in getTxFromBins on a malformed bin) left every not-yet-signalled per-item completion channel orphaned. The submitters then parked forever because the contexts threaded down from legacy sync/validation carry no deadline. - sendGetBatch wrapped the already-retrying v8 batch call in another 3x retry loop, stacking the worst-case stall to ~3x TotalTimeout (~15m). Fixes: - Panic safety net (signalBatchPanic + trySignal) on all 7 batch dispatch fns: a recovered panic now re-signals every per-item channel instead of orphaning waiters. - Bound every submitter wait with Store.batcherWait (batch policy TotalTimeout + grace): get, PreviousOutputsDecorate, Create, SetDAHForChildRecords, SetLocked. A wedged batcher releases the caller after the bound rather than for the life of the process. - sendGetBatch: drop the redundant outer 3x retry; rely on the v8 policy. - Per-fn correctness: - setLockedBatch: replace os.Exit(1) on key error with a per-item error; signal the previously-silent missing-LuaSuccess-bin path; bound the same-pool child-record recursion (breaks a worker deadlock). - sendSetDAHBatch: signal + NOOP placeholder on key-skip (was a nil batchRecords slot -> nil-deref panic and an orphaned errCh); fix a %s-without-arg format bug. - sendIncrementBatch: fix index desync after a key-creation skip via placeholder + handled[] tracking. - sendSpendBatchLua/executeSpendBatch: replace double-signal on already-completed buffered-1 items with non-blocking trySignal. - sendOutpointBatch: bounds-check the previous-output index. - Route BatchDecorate's BatchOperate through the batchOperateFn test seam. Tests: signalBatchPanic unit test; sendGetBatch panic-orphan and bounded-wait regression tests. build/vet/staticcheck/golangci-lint/gofmt clean; new tests pass under -race; aerospike integration tests (increment, setDAH, decorate, split-tx, TestAerospike, spend) green.

github-actions · 2026-06-02T19:34:30Z

🤖 Claude Code Review

Status: Complete

Summary

Production-critical goroutine leak fix with comprehensive panic safety guarantees and bounded waits. The changes directly address the root cause identified in the production incident (13,693 leaked goroutines).

Key Strengths

Correct root cause diagnosis: Panic recovery in go-batcher orphaned completion channels; unbounded waits parked submitters indefinitely
Multi-layered defense: Panic guards + bounded waits + non-blocking signals prevent permanent leaks
Consistent application: All 7 dispatch functions receive the same treatment (sendGetBatch, sendStoreBatch, sendSpendBatchLua, sendOutpointBatch, sendIncrementBatch, sendSetDAHBatch, setLockedBatch)
Comprehensive test coverage: Unit tests cover panic/error/timeout paths using the batchOperateFn seam
Well-documented: Inline comments explain the "why" behind each guard

Implementation Quality

batch_completion.go (new):

signalBatchPanic: Generic panic safety net with logging and metrics
trySignal: Non-blocking send primitive for buffered-1 channels
batcherWaitTimeout: Derived from TotalTimeout + 30s grace

All dispatch functions now have:

Deferred panic guard calling signalBatchPanic
trySignal for non-blocking completion delivery
Bounded waits with s.batcherWait timeout
Proper handling of already-signalled channels

Additional fixes:

setLockedBatch (locked.go:98): Replaced os.Exit(1) on key error with per-item error signaling
sendOutpointBatch (get.go:1390): Added bounds-check for previous-output index
sendGetBatch (get.go:1597): Removed nested 3× retry loop (v8 already retries)
sendIncrementBatch (spend.go:1159): Fixed index desync with placeholder + handled[] guard
sendSetDAHBatch (spend.go:1251): Fixed nil batchRecords slot panic

Observations

No blocking issues. Code is safe to merge pending human review and passing tests.

Note on same-pool recursion (locked.go:187-202):
The recursive child-record submission in setLockedBatch uses the same batcher pool. Under high extraRecords counts, dispatch workers could exhaust themselves waiting on waitForLockedResult for child records queued in the same pool. The bounded timeout prevents permanent deadlock, but workers may block for ~batcherWait (5m) per child when the pool is saturated. Consider capping extraRecords or using direct BatchOperate for children if worker blocking becomes observable under production load.

Review methodology: Static analysis only (fork PR security constraints)

github-actions · 2026-06-02T19:48:28Z

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1025 (3ae0f72)

Summary

Regressions: 0
Improvements: 0
Unchanged: 144
Significance level: p < 0.05

All benchmark results (sec/op)

Benchmark	Baseline	Current	Change	p-value
_NewBlockFromBytes-4	1.591µ	1.624µ	~	0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4	71.14n	71.34n	~	0.800
SplitSyncedParentMap_SetIfNotExists/16_buckets-4	71.33n	71.39n	~	0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4	71.16n	71.15n	~	1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets...	33.83n	33.54n	~	0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_...	57.49n	58.12n	~	0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa...	136.1n	143.0n	~	0.200
MiningCandidate_Stringify_Short-4	219.2n	222.2n	~	0.400
MiningCandidate_Stringify_Long-4	1.648µ	1.655µ	~	0.300
MiningSolution_Stringify-4	858.9n	855.8n	~	0.400
BlockInfo_MarshalJSON-4	1.744µ	1.738µ	~	0.100
NewFromBytes-4	132.5n	148.0n	~	0.400
AddTxBatchColumnar_Validation-4	2.460µ	2.482µ	~	1.000
OffsetValidationLoop-4	641.1n	640.0n	~	0.700
Mine_EasyDifficulty-4	63.69µ	61.47µ	~	0.200
Mine_WithAddress-4	6.897µ	6.988µ	~	0.100
BlockAssembler_AddTx-4	0.02917n	0.02681n	~	0.400
AddNode-4	10.67	10.71	~	1.000
AddNodeWithMap-4	12.23	11.58	~	0.700
DirectSubtreeAdd/4_per_subtree-4	57.77n	58.82n	~	0.700
DirectSubtreeAdd/64_per_subtree-4	30.01n	30.92n	~	0.300
DirectSubtreeAdd/256_per_subtree-4	28.84n	29.05n	~	0.100
DirectSubtreeAdd/1024_per_subtree-4	27.77n	28.04n	~	0.200
DirectSubtreeAdd/2048_per_subtree-4	27.36n	27.61n	~	0.100
SubtreeProcessorAdd/4_per_subtree-4	283.8n	279.1n	~	0.600
SubtreeProcessorAdd/64_per_subtree-4	276.8n	277.5n	~	1.000
SubtreeProcessorAdd/256_per_subtree-4	280.1n	277.8n	~	0.400
SubtreeProcessorAdd/1024_per_subtree-4	270.6n	268.5n	~	0.100
SubtreeProcessorAdd/2048_per_subtree-4	270.5n	267.4n	~	0.100
SubtreeProcessorRotate/4_per_subtree-4	275.7n	273.7n	~	0.200
SubtreeProcessorRotate/64_per_subtree-4	276.6n	276.3n	~	1.000
SubtreeProcessorRotate/256_per_subtree-4	272.7n	274.5n	~	0.100
SubtreeProcessorRotate/1024_per_subtree-4	273.4n	274.7n	~	0.600
SubtreeNodeAddOnly/4_per_subtree-4	54.26n	54.46n	~	0.200
SubtreeNodeAddOnly/64_per_subtree-4	34.23n	34.29n	~	0.600
SubtreeNodeAddOnly/256_per_subtree-4	33.11n	33.39n	~	0.200
SubtreeNodeAddOnly/1024_per_subtree-4	32.63n	32.57n	~	1.000
SubtreeCreationOnly/4_per_subtree-4	115.7n	114.2n	~	0.100
SubtreeCreationOnly/64_per_subtree-4	408.8n	403.4n	~	0.400
SubtreeCreationOnly/256_per_subtree-4	1.389µ	1.414µ	~	0.100
SubtreeCreationOnly/1024_per_subtree-4	4.437µ	4.438µ	~	0.700
SubtreeCreationOnly/2048_per_subtree-4	8.021µ	8.395µ	~	0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4	273.8n	272.2n	~	0.200
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4	273.1n	270.8n	~	0.100
ParallelGetAndSetIfNotExists/1k_nodes-4	2.213m	2.215m	~	1.000
ParallelGetAndSetIfNotExists/10k_nodes-4	5.388m	5.408m	~	0.100
ParallelGetAndSetIfNotExists/50k_nodes-4	7.404m	7.463m	~	0.700
ParallelGetAndSetIfNotExists/100k_nodes-4	10.57m	10.44m	~	0.100
SequentialGetAndSetIfNotExists/1k_nodes-4	1.951m	1.938m	~	0.400
SequentialGetAndSetIfNotExists/10k_nodes-4	4.410m	4.445m	~	1.000
SequentialGetAndSetIfNotExists/50k_nodes-4	12.26m	12.38m	~	0.400
SequentialGetAndSetIfNotExists/100k_nodes-4	22.29m	22.43m	~	0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4	2.239m	2.228m	~	0.400
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4	8.171m	8.162m	~	0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4	13.18m	13.11m	~	0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4	1.978m	1.962m	~	0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4	7.914m	7.612m	~	0.200
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4	39.91m	40.33m	~	0.100
DiskTxMap_SetIfNotExists-4	3.945µ	3.893µ	~	0.400
DiskTxMap_SetIfNotExists_Parallel-4	3.579µ	3.548µ	~	0.400
DiskTxMap_ExistenceOnly-4	444.2n	517.6n	~	0.100
Queue-4	191.9n	190.9n	~	0.700
AtomicPointer-4	3.621n	3.627n	~	0.300
ReorgOptimizations/DedupFilterPipeline/Old/10K-4	876.1µ	829.0µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4	796.7µ	763.6µ	~	0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4	116.8µ	110.7µ	~	0.200
ReorgOptimizations/AllMarkFalse/New/10K-4	64.19µ	64.06µ	~	0.700
ReorgOptimizations/HashSlicePool/Old/10K-4	58.19µ	62.72µ	~	0.200
ReorgOptimizations/HashSlicePool/New/10K-4	10.99µ	11.40µ	~	0.100
ReorgOptimizations/NodeFlags/Old/10K-4	4.660µ	4.718µ	~	0.700
ReorgOptimizations/NodeFlags/New/10K-4	1.604µ	1.655µ	~	0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4	10.33m	10.76m	~	0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4	10.73m	10.75m	~	1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4	1.170m	1.129m	~	0.700
ReorgOptimizations/AllMarkFalse/New/100K-4	706.4µ	706.9µ	~	1.000
ReorgOptimizations/HashSlicePool/Old/100K-4	498.3µ	627.0µ	~	0.100
ReorgOptimizations/HashSlicePool/New/100K-4	198.6µ	207.2µ	~	0.200
ReorgOptimizations/NodeFlags/Old/100K-4	49.09µ	50.23µ	~	0.700
ReorgOptimizations/NodeFlags/New/100K-4	16.93µ	17.57µ	~	0.100
TxMapSetIfNotExists-4	49.82n	49.84n	~	1.000
TxMapSetIfNotExistsDuplicate-4	41.59n	42.00n	~	0.100
ChannelSendReceive-4	604.0n	583.6n	~	0.100
CalcBlockWork-4	506.1n	507.8n	~	1.000
CalculateWork-4	689.8n	689.1n	~	0.700
BuildBlockLocatorString_Helpers/Size_10-4	1.351µ	1.354µ	~	0.800
BuildBlockLocatorString_Helpers/Size_100-4	12.83µ	13.04µ	~	0.400
BuildBlockLocatorString_Helpers/Size_1000-4	159.3µ	126.5µ	~	0.100
CatchupWithHeaderCache-4	104.8m	104.8m	~	0.700
_BufferPoolAllocation/16KB-4	3.917µ	4.053µ	~	0.100
_BufferPoolAllocation/32KB-4	8.675µ	10.531µ	~	0.100
_BufferPoolAllocation/64KB-4	19.17µ	17.96µ	~	0.700
_BufferPoolAllocation/128KB-4	36.51µ	36.75µ	~	0.200
_BufferPoolAllocation/512KB-4	105.8µ	127.2µ	~	0.100
_BufferPoolConcurrent/32KB-4	19.16µ	23.33µ	~	0.100
_BufferPoolConcurrent/64KB-4	33.95µ	32.85µ	~	0.700
_BufferPoolConcurrent/512KB-4	156.2µ	152.6µ	~	0.700
_SubtreeDeserializationWithBufferSizes/16KB-4	632.1µ	679.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/32KB-4	635.3µ	684.7µ	~	0.100
_SubtreeDeserializationWithBufferSizes/64KB-4	638.8µ	685.5µ	~	0.100
_SubtreeDeserializationWithBufferSizes/128KB-4	638.8µ	681.2µ	~	0.100
_SubtreeDeserializationWithBufferSizes/512KB-4	666.0µ	657.0µ	~	0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4	36.76m	36.82m	~	0.400
_SubtreeDataDeserializationWithBufferSizes/32KB-4	36.84m	36.50m	~	0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4	36.65m	36.55m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4	36.59m	36.49m	~	0.700
_SubtreeDataDeserializationWithBufferSizes/512KB-4	36.35m	36.28m	~	1.000
_PooledVsNonPooled/Pooled-4	741.1n	740.3n	~	0.700
_PooledVsNonPooled/NonPooled-4	8.750µ	8.532µ	~	0.400
_MemoryFootprint/Current_512KB_32concurrent-4	6.786µ	6.533µ	~	0.200
_MemoryFootprint/Proposed_32KB_32concurrent-4	11.84µ	10.66µ	~	0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4	10.78µ	10.67µ	~	1.000
_prepareTxsPerLevel-4	410.6m	409.2m	~	0.700
_prepareTxsPerLevelOrdered-4	3.597m	3.667m	~	0.200
_prepareTxsPerLevel_Comparison/Original-4	411.8m	413.9m	~	0.700
_prepareTxsPerLevel_Comparison/Optimized-4	3.672m	3.652m	~	0.700
SubtreeSizes/10k_tx_4_per_subtree-4	1.351m	1.355m	~	1.000
SubtreeSizes/10k_tx_16_per_subtree-4	316.3µ	318.2µ	~	1.000
SubtreeSizes/10k_tx_64_per_subtree-4	76.57µ	75.85µ	~	0.700
SubtreeSizes/10k_tx_256_per_subtree-4	19.17µ	19.03µ	~	0.100
SubtreeSizes/10k_tx_512_per_subtree-4	9.584µ	9.399µ	~	0.100
SubtreeSizes/10k_tx_1024_per_subtree-4	4.773µ	4.680µ	~	0.100
SubtreeSizes/10k_tx_2k_per_subtree-4	2.349µ	2.304µ	~	0.100
BlockSizeScaling/10k_tx_64_per_subtree-4	74.73µ	74.01µ	~	0.400
BlockSizeScaling/10k_tx_256_per_subtree-4	18.93µ	18.84µ	~	0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4	4.666µ	4.677µ	~	1.000
BlockSizeScaling/50k_tx_64_per_subtree-4	391.8µ	389.1µ	~	0.100
BlockSizeScaling/50k_tx_256_per_subtree-4	94.28µ	93.32µ	~	0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4	23.02µ	23.09µ	~	0.700
SubtreeAllocations/small_subtrees_exists_check-4	156.0µ	156.2µ	~	1.000
SubtreeAllocations/small_subtrees_data_fetch-4	172.6µ	167.4µ	~	0.400
SubtreeAllocations/small_subtrees_full_validation-4	325.2µ	323.2µ	~	0.400
SubtreeAllocations/medium_subtrees_exists_check-4	9.138µ	9.200µ	~	0.400
SubtreeAllocations/medium_subtrees_data_fetch-4	9.862µ	9.941µ	~	0.700
SubtreeAllocations/medium_subtrees_full_validation-4	18.97µ	18.78µ	~	0.700
SubtreeAllocations/large_subtrees_exists_check-4	2.182µ	2.194µ	~	0.700
SubtreeAllocations/large_subtrees_data_fetch-4	2.421µ	2.401µ	~	0.300
SubtreeAllocations/large_subtrees_full_validation-4	4.800µ	4.718µ	~	0.200
StoreBlock_Sequential/BelowCSVHeight-4	344.2µ	336.9µ	~	0.400
StoreBlock_Sequential/AboveCSVHeight-4	336.9µ	337.9µ	~	0.200
GetUtxoHashes-4	270.1n	265.6n	~	0.700
GetUtxoHashes_ManyOutputs-4	50.92µ	46.35µ	~	0.100
_NewMetaDataFromBytes-4	234.1n	233.4n	~	0.700
_Bytes-4	411.8n	411.8n	~	1.000
_MetaBytes-4	140.3n	140.6n	~	0.700

Threshold: >10% with p < 0.05 | Generated: 2026-06-04 12:09 UTC

ordishs

Approve with comments. The leak diagnosis is sound and the layered defense (panic-signal + bounded wait) is the right shape. Findings below were verified against the source, not just the diff.

Strengths

Excellent root-cause writeup; correctly separates the caller-side bug from the v8 wedge and scopes out the latent v8 connection.go deadline-reset issue.
trySignal semantics are correct for the buffered-1 channels (and the doc comment correctly warns against unbuffered use).
Real collateral bugs fixed: os.Exit(1) -> per-item error in setLockedBatch; index desync after key-skip in sendIncrementBatch/sendSetDAHBatch (placeholder + handled[] keeps batchRecords 1:1 with batch); stale-err in the increment result loop; new batchRecord.Record == nil guard before .Bins.
Buffering audit holds: get.done, outpoint errChan, increment res, spend errCh, locked errCh are all buffered-1, and SetDAHForChildRecords was correctly upgraded unbuffered -> buffered-1 here.

Concerns

1. Create's errCh is the lone unbuffered completion channel (Medium - consistency/robustness).
create.go:180 is make(chan error) + defer close(errCh), while every other completion channel in the package is buffered-1. The new ctx.Done()/timeoutCh arms mean Create can now depart before sendStoreBatch sends. I traced this: it does not permanently wedge, because worker sends go through util.SafeSend (has defer recover()) and the deferred close turns a post-departure send into a recovered "send on closed channel" panic. But that relies on recover-on-closed plus a benign select race, whereas everywhere else the send simply lands in the buffer. Recommend make(chan error, 1) for consistency (the resultHandledElsewhere guard ensures at most one send per item, so buffer-1 suffices).

2. setLockedBatch child recursion onto the same pool (Medium - latent, now bounded not eliminated).
setLockedBatch runs on a lockedBatcher worker and re-submits child items to the same lockedBatcher, then waits. Under enough concurrency this is pool-exhaustion deadlock. waitForLockedResult converts the previous permanent hang into a ServiceUnavailable after batcherWait (a strict improvement), but the "breaks a worker deadlock" wording overstates it - multi-record SetLocked can still fail with timeouts under load. Good follow-up: dispatch child records onto a separate pool or process inline. Not a blocker.

3. Dropping the nested retry in sendGetBatch (Low - confirm intent).
Well-justified by the amplification math. Only semantic shift: the old loop retried on any BatchDecorate error, including classes the v8 policy may not retry internally; those now surface immediately. Acceptable since get callers re-request.

4. Test coverage is partial (Low).
New unit tests cover signalBatchPanic, the get panic fan-out, and the get bounded wait. The bounded-wait/panic-guard paths for Create, PreviousOutputsDecorate, sendOutpointBatch, setLockedBatch, sendSpendBatchLua, sendIncrementBatch, sendSetDAHBatch rest on the live-Aerospike integration run. Shared signalBatchPanic/trySignal mitigate this; a table test per dispatch fn's panic guard would lock in the per-fn closures (e.g. the it != nil guard in spend).

Nits

var sendErr error = errors.New...() is redundant - channels are already typed; pass the *errors.Error directly to trySignal.

Only ask before merge is #1 (buffered-1 Create errCh). #2 and #4 are good follow-ups, not blockers.

Note: I verified channel buffering, placeholderKey/sendErrorAndClose, batcher pool wiring, and SafeSend semantics against source. I did not run the test suite myself; the PR reports it green.

…e BatchOperate seam Adds no-container unit tests that drive each dispatch fn through its panic, BatchOperate-error, and result paths so the leak-fix branches are covered by the unit-test run that feeds Sonar (the aerospike integration tests run in a separate gate that does not contribute coverage): - sendIncrementBatch, sendSetDAHBatch, setLockedBatch, sendSpendBatchLua, sendStoreBatch: assert every per-item channel is signalled on panic and on BatchOperate error (no orphaned submitters); setLockedBatch also exercises the previously-silent missing-LuaSuccess-bin path. To enable this without a live Aerospike, route the remaining direct s.client.BatchOperate calls through a single Store.batchOperate seam that honours the existing test-only batchOperateFn override (get and create already used it). Behaviour is unchanged when the override is nil.

…chain#1025 review) Make Create's errCh buffered-1, matching every other completion channel in the package. With the new bounded/cancellable wait, Create can depart before sendStoreBatch sends; a buffered channel lets that send land in the buffer rather than relying on the deferred close turning it into a recovered send-on-closed. The resultHandledElsewhere guard already bounds it to one send.

sonarqubecloud · 2026-06-04T12:19:32Z

Quality Gate failed

Failed conditions
74.4% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

ordishs

Re-review (after `9a21b5b` + `a1ac5c0`)

Both pre-merge items from the prior review are resolved, plus an unrequested improvement:

#1 (Create errCh) — fixed. create.go:185 is now make(chan error, 1). With buffered-1 the timeout/ctx-departure race no longer relies on send-on-closed-channel recovery: a late worker send either lands in the buffer or is discarded by the deferred close without panicking. Consistent with every other completion channel now.
Seam centralized (nice-to-have, done). All dispatch fns route through a single (*Store).batchOperate (aerospike.go:162) instead of scattered s.client.BatchOperate calls. This is cleaner than the original BatchDecorate-only seam and makes the new failure-path tests possible across every batcher.
#4 (dispatch failure-path coverage) — largely addressed. batch_dispatch_leak_test.go drives sendIncrementBatch, sendSetDAHBatch, setLockedBatch, sendSpendBatchLua, and sendStoreBatch through panic / BatchOperate-error / ok branches via the seam, asserting every item is signalled (no orphans). Ran the new suite locally under -race — green; go vet clean.

Residual (non-blocking) gaps:

No unit test for sendOutpointBatch's panic guard + output-index bounds check (it reads via a different API, so the batchOperate seam doesn't reach it).
Caller-side bounded-wait timeout tests remain get-only; Create/PreviousOutputsDecorate/SetLocked/SetDAHForChildRecords timeout arms are covered only by the live-Aerospike run.
#2 (setLockedBatch same-pool child recursion) still a latent deadlock now bounded to a timeout — good follow-up, not a blocker.

Approval stands.

ordishs

Approve. Both pre-merge items from the earlier review are resolved and verified.

Create errCh is now buffered-1 (create.go:185), consistent with every other completion channel; the timeout/ctx-departure path no longer relies on send-on-closed-channel recovery.
BatchOperate seam centralized into (*Store).batchOperate (aerospike.go:162) — all dispatch fns route through it.
Dispatch failure-path coverage added in batch_dispatch_leak_test.go for increment / setDAH / locked / spend / store across panic / BatchOperate-error / ok branches, asserting no item is orphaned.

Verified locally: the new dispatch/panic/bounded-wait suite passes under -race, go vet clean.

Residual follow-ups (non-blocking): no unit test for sendOutpointBatch's panic guard + index bounds-check; caller-side bounded-wait tests remain get-only; setLockedBatch same-pool child recursion is now bounded to a timeout rather than eliminated.

LGTM to merge.

freemans13

Approving — root-cause analysis is precise and the fix is correctly layered (panic safety net + bounded batcherWait keystone + nested-retry removal), with strong race-tested coverage across every dispatch fn. Verified locally: go vet clean, 20 new tests pass under -race.

Genuine bugs fixed alongside the leak (not just symptoms): setLockedBatch os.Exit(1)→per-item error, sendIncrementBatch index desync + wrong-err logging, sendSetDAHBatch nil-slot nil-deref + format bug, sendOutpointBatch output-index bounds check.

Non-blocking follow-ups to consider:

batcherWait clocks from enqueue while TotalTimeout bounds only the batch op — under a deep backlog a healthy-but-slow batch could trip a spurious ServiceUnavailable. Worth sanity-checking the 30s grace against peak queueing latency.
Create now returns on ctx-cancel/timeout while sendStoreBatch completes the write independently — callers must tolerate 'error returned, write may still land' (likely fine given TxExists idempotency). Worth a doc note.
setLockedBatch child recursion still pins a worker for up to batcherWait — correctly bounded now, full elimination tracked in #1033.

Minor: the bounded-wait block is hand-inlined in 4 places while SetLocked got a waitForLockedResult helper — a single shared helper would prevent drift.

Resolves a semantic conflict in stores/utxo/aerospike/locked.go where both branches rewrote setLockedBatch: - upstream bsv-blockchain#1025 ("stop caller-side leak"): trySignal, handled[] tracking, panic-recovery defer (signalBatchPanic), placeholderKey key-error handling, batchOperate wrapper, and bounded waitForLockedResult — but kept the child self-requeue (lockedBatcher.PutCtx from inside the callback). - this branch (bsv-blockchain#928): removed the self-requeue, handling child records inline, because bsv-blockchain#928 closes the lockedBatcher on shutdown and the self-requeue then Puts into a closed channel (panic in an errgroup goroutine -> process crash), which upstream's bounded wait does not prevent. Resolution: keep ALL of upstream's scaffolding and replace the child self-requeue with the inline child-record BatchOperate (using batchOperate + trySignal). Verified: build + go vet (incl. -tags aerospike); aerospike close-drain + lock/multirecord/duplicate-spend tests pass against a real container; daemon closeStores tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

oskarszoon requested review from icellan and ordishs June 2, 2026 19:34

ordishs approved these changes Jun 2, 2026

View reviewed changes

oskarszoon added 4 commits June 4, 2026 09:59

Merge remote-tracking branch 'upstream/main' into fix/aerospike-conns

69ff92f

test(utxo/aerospike): fix gci import grouping in dispatch leak test

54b1f1b

oskarszoon mentioned this pull request Jun 4, 2026

setLockedBatch: same-pool child-record recursion can exhaust the lockedBatcher worker pool #1033

Open

oskarszoon requested a review from ordishs June 4, 2026 11:56

oskarszoon requested a review from freemans13 June 4, 2026 14:09

ordishs reviewed Jun 4, 2026

View reviewed changes

ordishs approved these changes Jun 4, 2026

View reviewed changes

freemans13 approved these changes Jun 4, 2026

View reviewed changes

oskarszoon merged commit c011f52 into bsv-blockchain:main Jun 4, 2026
36 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(utxo/aerospike): stop caller-side goroutine/connection leak in batchers#1025

fix(utxo/aerospike): stop caller-side goroutine/connection leak in batchers#1025
oskarszoon merged 5 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/aerospike-conns

oskarszoon commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

ordishs left a comment

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Uh oh!

ordishs left a comment

Uh oh!

ordishs left a comment

Uh oh!

freemans13 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

oskarszoon commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause (teranode-side, not the v8 client)

Changes

Verification

Review follow-ups (@ordishs)

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Strengths

Implementation Quality

Observations

Uh oh!

github-actions Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Comparison Report

Summary

Uh oh!

ordishs left a comment

Choose a reason for hiding this comment

Strengths

Concerns

Nits

Uh oh!

sonarqubecloud Bot commented Jun 4, 2026

Quality Gate failed

Uh oh!

ordishs left a comment

Choose a reason for hiding this comment

Re-review (after 9a21b5b + a1ac5c0)

Uh oh!

ordishs left a comment

Choose a reason for hiding this comment

Uh oh!

freemans13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oskarszoon commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

github-actions Bot commented Jun 2, 2026 •

edited

Loading

Re-review (after `9a21b5b` + `a1ac5c0`)