Skip to content

fix(utxo/aerospike): stop caller-side goroutine/connection leak in batchers#1025

Merged
oskarszoon merged 5 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/aerospike-conns
Jun 4, 2026
Merged

fix(utxo/aerospike): stop caller-side goroutine/connection leak in batchers#1025
oskarszoon merged 5 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/aerospike-conns

Conversation

@oskarszoon

@oskarszoon oskarszoon commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Problem

A production node (bsva-ovh-teranode-eu-2) accumulated 13,693 goroutines — 8,664 parked in (*Store).get's select and 4,096 in validator errgroups — that persisted after the underlying Aerospike wedge cleared and client_connections had returned to 0. A leak that survives the wedge is not the wedge; it's a caller-side bug holding the resources.

Root cause (teranode-side, not the v8 client)

  1. Orphaned completion channels on panic. go-batcher recovers panics raised inside the batch dispatch fn (dispatchAndRecord wraps b.fn(batch) in defer recover()). A panic part-way through a sendXxxBatch — e.g. an unchecked type assertion in getTxFromBins on a malformed/missing bin — left every not-yet-signalled per-item completion channel orphaned. The worker survived; the submitters parked forever, because the contexts threaded down from legacy sync / validation carry no deadline, so the get select's ctx.Done() arm never fires.
  2. Retry amplification. sendGetBatch wrapped the already-retrying v8 batch call (MaxRetries within TotalTimeout=5m) in another 3× retry loop, stacking the worst-case stall to ~3× TotalTimeout (~15m) and growing the submitter backlog faster than it drained under sustained server slowness.

The v8 client itself bounds the batch read at TotalTimeout (5m) and releases the connection — verified against v8.7.1-bsv3. No v8 change is required to stop this leak. (One latent v8 robustness bug — connection.go resets the socket deadline on every partial read, so SocketTimeout is per-syscall and a slow-drip server evades it; only bites callers with TotalTimeout=0, which teranode is not — flagged separately, out of scope here.)

Changes

  • Panic safety netsignalBatchPanic + trySignal (new batch_completion.go); a recover-defer on all 7 dispatch fns (sendGetBatch, sendOutpointBatch, sendStoreBatch, sendSpendBatchLua, sendIncrementBatch, sendSetDAHBatch, setLockedBatch) re-signals every per-item channel on panic. Non-blocking, so it never double-delivers or wedges the worker.
  • Bounded submitter waits (keystone) — new cached Store.batcherWait (= batch policy TotalTimeout + 30s grace) on (*Store).get, PreviousOutputsDecorate, Create, SetDAHForChildRecords, SetLocked. A wedged batcher now releases the caller after the bound instead of for the life of the process.
  • Drop nested retry in sendGetBatch — rely on the v8 policy's own retries.
  • Per-fn correctness bugs:
    • setLockedBatch: os.Exit(1) on a key error → per-item error; signal the previously-silent missing-LuaSuccess-bin path; bound the same-pool child-record recursion with a timeout — this converts a potential pool-exhaustion hang into a bounded ServiceUnavailable; fully eliminating the recursion is tracked in setLockedBatch: same-pool child-record recursion can exhaust the lockedBatcher worker pool #1033.
    • sendSetDAHBatch: signal + NOOP placeholder on key-skip (was a nil batchRecords slot → nil-deref panic + orphaned errCh); fix a %s-without-arg format bug.
    • sendIncrementBatch: fix index desync after a key-creation skip (placeholder + handled[]).
    • sendSpendBatchLua / executeSpendBatch: replace double-signal on already-completed buffered-1 items with non-blocking trySignal.
    • sendOutpointBatch: bounds-check the previous-output index.
  • Route BatchDecorate's BatchOperate through the existing batchOperateFn test seam.

Verification

go build ./..., go vet, staticcheck, golangci-lint (0 issues), gofmt all clean. New unit tests pass under -race. Live-container Aerospike integration green: increment, setDAH/drift, decorate, split-tx (SetLocked child recursion), TestAerospike, spend (dup-spender / multi-record / unspend / nil-panic). Full package unit suite passes.

Review follow-ups (@ordishs)

…tchers

A production node accumulated 13,693 goroutines (8,664 parked in
(*Store).get's select, 4,096 in validator errgroups) that persisted after
the underlying aerospike wedge cleared and client_connections returned to
0. Root cause is teranode-side, not the v8 client:

- go-batcher recovers panics raised in the batch dispatch fn, so a panic
  part-way through a sendXxxBatch (e.g. an unchecked type assertion in
  getTxFromBins on a malformed bin) left every not-yet-signalled per-item
  completion channel orphaned. The submitters then parked forever because
  the contexts threaded down from legacy sync/validation carry no deadline.
- sendGetBatch wrapped the already-retrying v8 batch call in another 3x
  retry loop, stacking the worst-case stall to ~3x TotalTimeout (~15m).

Fixes:
- Panic safety net (signalBatchPanic + trySignal) on all 7 batch dispatch
  fns: a recovered panic now re-signals every per-item channel instead of
  orphaning waiters.
- Bound every submitter wait with Store.batcherWait (batch policy
  TotalTimeout + grace): get, PreviousOutputsDecorate, Create,
  SetDAHForChildRecords, SetLocked. A wedged batcher releases the caller
  after the bound rather than for the life of the process.
- sendGetBatch: drop the redundant outer 3x retry; rely on the v8 policy.
- Per-fn correctness:
  - setLockedBatch: replace os.Exit(1) on key error with a per-item error;
    signal the previously-silent missing-LuaSuccess-bin path; bound the
    same-pool child-record recursion (breaks a worker deadlock).
  - sendSetDAHBatch: signal + NOOP placeholder on key-skip (was a nil
    batchRecords slot -> nil-deref panic and an orphaned errCh); fix a
    %s-without-arg format bug.
  - sendIncrementBatch: fix index desync after a key-creation skip via
    placeholder + handled[] tracking.
  - sendSpendBatchLua/executeSpendBatch: replace double-signal on
    already-completed buffered-1 items with non-blocking trySignal.
  - sendOutpointBatch: bounds-check the previous-output index.
- Route BatchDecorate's BatchOperate through the batchOperateFn test seam.

Tests: signalBatchPanic unit test; sendGetBatch panic-orphan and
bounded-wait regression tests. build/vet/staticcheck/golangci-lint/gofmt
clean; new tests pass under -race; aerospike integration tests (increment,
setDAH, decorate, split-tx, TestAerospike, spend) green.
@oskarszoon oskarszoon requested review from icellan and ordishs June 2, 2026 19:34
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete


Summary

Production-critical goroutine leak fix with comprehensive panic safety guarantees and bounded waits. The changes directly address the root cause identified in the production incident (13,693 leaked goroutines).

Key Strengths

  • Correct root cause diagnosis: Panic recovery in go-batcher orphaned completion channels; unbounded waits parked submitters indefinitely
  • Multi-layered defense: Panic guards + bounded waits + non-blocking signals prevent permanent leaks
  • Consistent application: All 7 dispatch functions receive the same treatment (sendGetBatch, sendStoreBatch, sendSpendBatchLua, sendOutpointBatch, sendIncrementBatch, sendSetDAHBatch, setLockedBatch)
  • Comprehensive test coverage: Unit tests cover panic/error/timeout paths using the batchOperateFn seam
  • Well-documented: Inline comments explain the "why" behind each guard

Implementation Quality

batch_completion.go (new):

  • signalBatchPanic: Generic panic safety net with logging and metrics
  • trySignal: Non-blocking send primitive for buffered-1 channels
  • batcherWaitTimeout: Derived from TotalTimeout + 30s grace

All dispatch functions now have:

  1. Deferred panic guard calling signalBatchPanic
  2. trySignal for non-blocking completion delivery
  3. Bounded waits with s.batcherWait timeout
  4. Proper handling of already-signalled channels

Additional fixes:

  • setLockedBatch (locked.go:98): Replaced os.Exit(1) on key error with per-item error signaling
  • sendOutpointBatch (get.go:1390): Added bounds-check for previous-output index
  • sendGetBatch (get.go:1597): Removed nested 3× retry loop (v8 already retries)
  • sendIncrementBatch (spend.go:1159): Fixed index desync with placeholder + handled[] guard
  • sendSetDAHBatch (spend.go:1251): Fixed nil batchRecords slot panic

Observations

No blocking issues. Code is safe to merge pending human review and passing tests.

Note on same-pool recursion (locked.go:187-202):
The recursive child-record submission in setLockedBatch uses the same batcher pool. Under high extraRecords counts, dispatch workers could exhaust themselves waiting on waitForLockedResult for child records queued in the same pool. The bounded timeout prevents permanent deadlock, but workers may block for ~batcherWait (5m) per child when the pool is saturated. Consider capping extraRecords or using direct BatchOperate for children if worker blocking becomes observable under production load.


Review methodology: Static analysis only (fork PR security constraints)

@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1025 (3ae0f72)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 144
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.591µ 1.624µ ~ 0.700
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 71.14n 71.34n ~ 0.800
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 71.33n 71.39n ~ 0.700
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 71.16n 71.15n ~ 1.000
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 33.83n 33.54n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 57.49n 58.12n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 136.1n 143.0n ~ 0.200
MiningCandidate_Stringify_Short-4 219.2n 222.2n ~ 0.400
MiningCandidate_Stringify_Long-4 1.648µ 1.655µ ~ 0.300
MiningSolution_Stringify-4 858.9n 855.8n ~ 0.400
BlockInfo_MarshalJSON-4 1.744µ 1.738µ ~ 0.100
NewFromBytes-4 132.5n 148.0n ~ 0.400
AddTxBatchColumnar_Validation-4 2.460µ 2.482µ ~ 1.000
OffsetValidationLoop-4 641.1n 640.0n ~ 0.700
Mine_EasyDifficulty-4 63.69µ 61.47µ ~ 0.200
Mine_WithAddress-4 6.897µ 6.988µ ~ 0.100
BlockAssembler_AddTx-4 0.02917n 0.02681n ~ 0.400
AddNode-4 10.67 10.71 ~ 1.000
AddNodeWithMap-4 12.23 11.58 ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 57.77n 58.82n ~ 0.700
DirectSubtreeAdd/64_per_subtree-4 30.01n 30.92n ~ 0.300
DirectSubtreeAdd/256_per_subtree-4 28.84n 29.05n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 27.77n 28.04n ~ 0.200
DirectSubtreeAdd/2048_per_subtree-4 27.36n 27.61n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 283.8n 279.1n ~ 0.600
SubtreeProcessorAdd/64_per_subtree-4 276.8n 277.5n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 280.1n 277.8n ~ 0.400
SubtreeProcessorAdd/1024_per_subtree-4 270.6n 268.5n ~ 0.100
SubtreeProcessorAdd/2048_per_subtree-4 270.5n 267.4n ~ 0.100
SubtreeProcessorRotate/4_per_subtree-4 275.7n 273.7n ~ 0.200
SubtreeProcessorRotate/64_per_subtree-4 276.6n 276.3n ~ 1.000
SubtreeProcessorRotate/256_per_subtree-4 272.7n 274.5n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 273.4n 274.7n ~ 0.600
SubtreeNodeAddOnly/4_per_subtree-4 54.26n 54.46n ~ 0.200
SubtreeNodeAddOnly/64_per_subtree-4 34.23n 34.29n ~ 0.600
SubtreeNodeAddOnly/256_per_subtree-4 33.11n 33.39n ~ 0.200
SubtreeNodeAddOnly/1024_per_subtree-4 32.63n 32.57n ~ 1.000
SubtreeCreationOnly/4_per_subtree-4 115.7n 114.2n ~ 0.100
SubtreeCreationOnly/64_per_subtree-4 408.8n 403.4n ~ 0.400
SubtreeCreationOnly/256_per_subtree-4 1.389µ 1.414µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.437µ 4.438µ ~ 0.700
SubtreeCreationOnly/2048_per_subtree-4 8.021µ 8.395µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 273.8n 272.2n ~ 0.200
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 273.1n 270.8n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.213m 2.215m ~ 1.000
ParallelGetAndSetIfNotExists/10k_nodes-4 5.388m 5.408m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 7.404m 7.463m ~ 0.700
ParallelGetAndSetIfNotExists/100k_nodes-4 10.57m 10.44m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.951m 1.938m ~ 0.400
SequentialGetAndSetIfNotExists/10k_nodes-4 4.410m 4.445m ~ 1.000
SequentialGetAndSetIfNotExists/50k_nodes-4 12.26m 12.38m ~ 0.400
SequentialGetAndSetIfNotExists/100k_nodes-4 22.29m 22.43m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.239m 2.228m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.171m 8.162m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.18m 13.11m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.978m 1.962m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 7.914m 7.612m ~ 0.200
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 39.91m 40.33m ~ 0.100
DiskTxMap_SetIfNotExists-4 3.945µ 3.893µ ~ 0.400
DiskTxMap_SetIfNotExists_Parallel-4 3.579µ 3.548µ ~ 0.400
DiskTxMap_ExistenceOnly-4 444.2n 517.6n ~ 0.100
Queue-4 191.9n 190.9n ~ 0.700
AtomicPointer-4 3.621n 3.627n ~ 0.300
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 876.1µ 829.0µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/New/10K-4 796.7µ 763.6µ ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/10K-4 116.8µ 110.7µ ~ 0.200
ReorgOptimizations/AllMarkFalse/New/10K-4 64.19µ 64.06µ ~ 0.700
ReorgOptimizations/HashSlicePool/Old/10K-4 58.19µ 62.72µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/10K-4 10.99µ 11.40µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 4.660µ 4.718µ ~ 0.700
ReorgOptimizations/NodeFlags/New/10K-4 1.604µ 1.655µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 10.33m 10.76m ~ 0.700
ReorgOptimizations/DedupFilterPipeline/New/100K-4 10.73m 10.75m ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.170m 1.129m ~ 0.700
ReorgOptimizations/AllMarkFalse/New/100K-4 706.4µ 706.9µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/100K-4 498.3µ 627.0µ ~ 0.100
ReorgOptimizations/HashSlicePool/New/100K-4 198.6µ 207.2µ ~ 0.200
ReorgOptimizations/NodeFlags/Old/100K-4 49.09µ 50.23µ ~ 0.700
ReorgOptimizations/NodeFlags/New/100K-4 16.93µ 17.57µ ~ 0.100
TxMapSetIfNotExists-4 49.82n 49.84n ~ 1.000
TxMapSetIfNotExistsDuplicate-4 41.59n 42.00n ~ 0.100
ChannelSendReceive-4 604.0n 583.6n ~ 0.100
CalcBlockWork-4 506.1n 507.8n ~ 1.000
CalculateWork-4 689.8n 689.1n ~ 0.700
BuildBlockLocatorString_Helpers/Size_10-4 1.351µ 1.354µ ~ 0.800
BuildBlockLocatorString_Helpers/Size_100-4 12.83µ 13.04µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_1000-4 159.3µ 126.5µ ~ 0.100
CatchupWithHeaderCache-4 104.8m 104.8m ~ 0.700
_BufferPoolAllocation/16KB-4 3.917µ 4.053µ ~ 0.100
_BufferPoolAllocation/32KB-4 8.675µ 10.531µ ~ 0.100
_BufferPoolAllocation/64KB-4 19.17µ 17.96µ ~ 0.700
_BufferPoolAllocation/128KB-4 36.51µ 36.75µ ~ 0.200
_BufferPoolAllocation/512KB-4 105.8µ 127.2µ ~ 0.100
_BufferPoolConcurrent/32KB-4 19.16µ 23.33µ ~ 0.100
_BufferPoolConcurrent/64KB-4 33.95µ 32.85µ ~ 0.700
_BufferPoolConcurrent/512KB-4 156.2µ 152.6µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 632.1µ 679.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/32KB-4 635.3µ 684.7µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 638.8µ 685.5µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 638.8µ 681.2µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 666.0µ 657.0µ ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.76m 36.82m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.84m 36.50m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.65m 36.55m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.59m 36.49m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.35m 36.28m ~ 1.000
_PooledVsNonPooled/Pooled-4 741.1n 740.3n ~ 0.700
_PooledVsNonPooled/NonPooled-4 8.750µ 8.532µ ~ 0.400
_MemoryFootprint/Current_512KB_32concurrent-4 6.786µ 6.533µ ~ 0.200
_MemoryFootprint/Proposed_32KB_32concurrent-4 11.84µ 10.66µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 10.78µ 10.67µ ~ 1.000
_prepareTxsPerLevel-4 410.6m 409.2m ~ 0.700
_prepareTxsPerLevelOrdered-4 3.597m 3.667m ~ 0.200
_prepareTxsPerLevel_Comparison/Original-4 411.8m 413.9m ~ 0.700
_prepareTxsPerLevel_Comparison/Optimized-4 3.672m 3.652m ~ 0.700
SubtreeSizes/10k_tx_4_per_subtree-4 1.351m 1.355m ~ 1.000
SubtreeSizes/10k_tx_16_per_subtree-4 316.3µ 318.2µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 76.57µ 75.85µ ~ 0.700
SubtreeSizes/10k_tx_256_per_subtree-4 19.17µ 19.03µ ~ 0.100
SubtreeSizes/10k_tx_512_per_subtree-4 9.584µ 9.399µ ~ 0.100
SubtreeSizes/10k_tx_1024_per_subtree-4 4.773µ 4.680µ ~ 0.100
SubtreeSizes/10k_tx_2k_per_subtree-4 2.349µ 2.304µ ~ 0.100
BlockSizeScaling/10k_tx_64_per_subtree-4 74.73µ 74.01µ ~ 0.400
BlockSizeScaling/10k_tx_256_per_subtree-4 18.93µ 18.84µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.666µ 4.677µ ~ 1.000
BlockSizeScaling/50k_tx_64_per_subtree-4 391.8µ 389.1µ ~ 0.100
BlockSizeScaling/50k_tx_256_per_subtree-4 94.28µ 93.32µ ~ 0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4 23.02µ 23.09µ ~ 0.700
SubtreeAllocations/small_subtrees_exists_check-4 156.0µ 156.2µ ~ 1.000
SubtreeAllocations/small_subtrees_data_fetch-4 172.6µ 167.4µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 325.2µ 323.2µ ~ 0.400
SubtreeAllocations/medium_subtrees_exists_check-4 9.138µ 9.200µ ~ 0.400
SubtreeAllocations/medium_subtrees_data_fetch-4 9.862µ 9.941µ ~ 0.700
SubtreeAllocations/medium_subtrees_full_validation-4 18.97µ 18.78µ ~ 0.700
SubtreeAllocations/large_subtrees_exists_check-4 2.182µ 2.194µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.421µ 2.401µ ~ 0.300
SubtreeAllocations/large_subtrees_full_validation-4 4.800µ 4.718µ ~ 0.200
StoreBlock_Sequential/BelowCSVHeight-4 344.2µ 336.9µ ~ 0.400
StoreBlock_Sequential/AboveCSVHeight-4 336.9µ 337.9µ ~ 0.200
GetUtxoHashes-4 270.1n 265.6n ~ 0.700
GetUtxoHashes_ManyOutputs-4 50.92µ 46.35µ ~ 0.100
_NewMetaDataFromBytes-4 234.1n 233.4n ~ 0.700
_Bytes-4 411.8n 411.8n ~ 1.000
_MetaBytes-4 140.3n 140.6n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-06-04 12:09 UTC

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with comments. The leak diagnosis is sound and the layered defense (panic-signal + bounded wait) is the right shape. Findings below were verified against the source, not just the diff.

Strengths

  • Excellent root-cause writeup; correctly separates the caller-side bug from the v8 wedge and scopes out the latent v8 connection.go deadline-reset issue.
  • trySignal semantics are correct for the buffered-1 channels (and the doc comment correctly warns against unbuffered use).
  • Real collateral bugs fixed: os.Exit(1) -> per-item error in setLockedBatch; index desync after key-skip in sendIncrementBatch/sendSetDAHBatch (placeholder + handled[] keeps batchRecords 1:1 with batch); stale-err in the increment result loop; new batchRecord.Record == nil guard before .Bins.
  • Buffering audit holds: get.done, outpoint errChan, increment res, spend errCh, locked errCh are all buffered-1, and SetDAHForChildRecords was correctly upgraded unbuffered -> buffered-1 here.

Concerns

1. Create's errCh is the lone unbuffered completion channel (Medium - consistency/robustness).
create.go:180 is make(chan error) + defer close(errCh), while every other completion channel in the package is buffered-1. The new ctx.Done()/timeoutCh arms mean Create can now depart before sendStoreBatch sends. I traced this: it does not permanently wedge, because worker sends go through util.SafeSend (has defer recover()) and the deferred close turns a post-departure send into a recovered "send on closed channel" panic. But that relies on recover-on-closed plus a benign select race, whereas everywhere else the send simply lands in the buffer. Recommend make(chan error, 1) for consistency (the resultHandledElsewhere guard ensures at most one send per item, so buffer-1 suffices).

2. setLockedBatch child recursion onto the same pool (Medium - latent, now bounded not eliminated).
setLockedBatch runs on a lockedBatcher worker and re-submits child items to the same lockedBatcher, then waits. Under enough concurrency this is pool-exhaustion deadlock. waitForLockedResult converts the previous permanent hang into a ServiceUnavailable after batcherWait (a strict improvement), but the "breaks a worker deadlock" wording overstates it - multi-record SetLocked can still fail with timeouts under load. Good follow-up: dispatch child records onto a separate pool or process inline. Not a blocker.

3. Dropping the nested retry in sendGetBatch (Low - confirm intent).
Well-justified by the amplification math. Only semantic shift: the old loop retried on any BatchDecorate error, including classes the v8 policy may not retry internally; those now surface immediately. Acceptable since get callers re-request.

4. Test coverage is partial (Low).
New unit tests cover signalBatchPanic, the get panic fan-out, and the get bounded wait. The bounded-wait/panic-guard paths for Create, PreviousOutputsDecorate, sendOutpointBatch, setLockedBatch, sendSpendBatchLua, sendIncrementBatch, sendSetDAHBatch rest on the live-Aerospike integration run. Shared signalBatchPanic/trySignal mitigate this; a table test per dispatch fn's panic guard would lock in the per-fn closures (e.g. the it != nil guard in spend).

Nits

  • var sendErr error = errors.New...() is redundant - channels are already typed; pass the *errors.Error directly to trySignal.

Only ask before merge is #1 (buffered-1 Create errCh). #2 and #4 are good follow-ups, not blockers.

Note: I verified channel buffering, placeholderKey/sendErrorAndClose, batcher pool wiring, and SafeSend semantics against source. I did not run the test suite myself; the PR reports it green.

…e BatchOperate seam

Adds no-container unit tests that drive each dispatch fn through its panic,
BatchOperate-error, and result paths so the leak-fix branches are covered by
the unit-test run that feeds Sonar (the aerospike integration tests run in a
separate gate that does not contribute coverage):

- sendIncrementBatch, sendSetDAHBatch, setLockedBatch, sendSpendBatchLua,
  sendStoreBatch: assert every per-item channel is signalled on panic and on
  BatchOperate error (no orphaned submitters); setLockedBatch also exercises
  the previously-silent missing-LuaSuccess-bin path.

To enable this without a live Aerospike, route the remaining direct
s.client.BatchOperate calls through a single Store.batchOperate seam that
honours the existing test-only batchOperateFn override (get and create already
used it). Behaviour is unchanged when the override is nil.
…chain#1025 review)

Make Create's errCh buffered-1, matching every other completion channel in
the package. With the new bounded/cancellable wait, Create can depart before
sendStoreBatch sends; a buffered channel lets that send land in the buffer
rather than relying on the deferred close turning it into a recovered
send-on-closed. The resultHandledElsewhere guard already bounds it to one send.
@sonarqubecloud

sonarqubecloud Bot commented Jun 4, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
74.4% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

@oskarszoon oskarszoon requested a review from freemans13 June 4, 2026 14:09

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review (after 9a21b5b + a1ac5c0)

Both pre-merge items from the prior review are resolved, plus an unrequested improvement:

  • #1 (Create errCh) — fixed. create.go:185 is now make(chan error, 1). With buffered-1 the timeout/ctx-departure race no longer relies on send-on-closed-channel recovery: a late worker send either lands in the buffer or is discarded by the deferred close without panicking. Consistent with every other completion channel now.
  • Seam centralized (nice-to-have, done). All dispatch fns route through a single (*Store).batchOperate (aerospike.go:162) instead of scattered s.client.BatchOperate calls. This is cleaner than the original BatchDecorate-only seam and makes the new failure-path tests possible across every batcher.
  • #4 (dispatch failure-path coverage) — largely addressed. batch_dispatch_leak_test.go drives sendIncrementBatch, sendSetDAHBatch, setLockedBatch, sendSpendBatchLua, and sendStoreBatch through panic / BatchOperate-error / ok branches via the seam, asserting every item is signalled (no orphans). Ran the new suite locally under -race — green; go vet clean.

Residual (non-blocking) gaps:

  • No unit test for sendOutpointBatch's panic guard + output-index bounds check (it reads via a different API, so the batchOperate seam doesn't reach it).
  • Caller-side bounded-wait timeout tests remain get-only; Create/PreviousOutputsDecorate/SetLocked/SetDAHForChildRecords timeout arms are covered only by the live-Aerospike run.
  • #2 (setLockedBatch same-pool child recursion) still a latent deadlock now bounded to a timeout — good follow-up, not a blocker.

Approval stands.

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Both pre-merge items from the earlier review are resolved and verified.

  • Create errCh is now buffered-1 (create.go:185), consistent with every other completion channel; the timeout/ctx-departure path no longer relies on send-on-closed-channel recovery.
  • BatchOperate seam centralized into (*Store).batchOperate (aerospike.go:162) — all dispatch fns route through it.
  • Dispatch failure-path coverage added in batch_dispatch_leak_test.go for increment / setDAH / locked / spend / store across panic / BatchOperate-error / ok branches, asserting no item is orphaned.

Verified locally: the new dispatch/panic/bounded-wait suite passes under -race, go vet clean.

Residual follow-ups (non-blocking): no unit test for sendOutpointBatch's panic guard + index bounds-check; caller-side bounded-wait tests remain get-only; setLockedBatch same-pool child recursion is now bounded to a timeout rather than eliminated.

LGTM to merge.

@freemans13 freemans13 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — root-cause analysis is precise and the fix is correctly layered (panic safety net + bounded batcherWait keystone + nested-retry removal), with strong race-tested coverage across every dispatch fn. Verified locally: go vet clean, 20 new tests pass under -race.

Genuine bugs fixed alongside the leak (not just symptoms): setLockedBatch os.Exit(1)→per-item error, sendIncrementBatch index desync + wrong-err logging, sendSetDAHBatch nil-slot nil-deref + format bug, sendOutpointBatch output-index bounds check.

Non-blocking follow-ups to consider:

  1. batcherWait clocks from enqueue while TotalTimeout bounds only the batch op — under a deep backlog a healthy-but-slow batch could trip a spurious ServiceUnavailable. Worth sanity-checking the 30s grace against peak queueing latency.
  2. Create now returns on ctx-cancel/timeout while sendStoreBatch completes the write independently — callers must tolerate 'error returned, write may still land' (likely fine given TxExists idempotency). Worth a doc note.
  3. setLockedBatch child recursion still pins a worker for up to batcherWait — correctly bounded now, full elimination tracked in #1033.

Minor: the bounded-wait block is hand-inlined in 4 places while SetLocked got a waitForLockedResult helper — a single shared helper would prevent drift.

@oskarszoon oskarszoon merged commit c011f52 into bsv-blockchain:main Jun 4, 2026
36 of 37 checks passed
freemans13 added a commit to freemans13/teranode that referenced this pull request Jun 5, 2026
Resolves a semantic conflict in stores/utxo/aerospike/locked.go where both
branches rewrote setLockedBatch:

- upstream bsv-blockchain#1025 ("stop caller-side leak"): trySignal, handled[] tracking,
  panic-recovery defer (signalBatchPanic), placeholderKey key-error handling,
  batchOperate wrapper, and bounded waitForLockedResult — but kept the child
  self-requeue (lockedBatcher.PutCtx from inside the callback).
- this branch (bsv-blockchain#928): removed the self-requeue, handling child records inline,
  because bsv-blockchain#928 closes the lockedBatcher on shutdown and the self-requeue then
  Puts into a closed channel (panic in an errgroup goroutine -> process crash),
  which upstream's bounded wait does not prevent.

Resolution: keep ALL of upstream's scaffolding and replace the child
self-requeue with the inline child-record BatchOperate (using batchOperate +
trySignal). Verified: build + go vet (incl. -tags aerospike); aerospike
close-drain + lock/multirecord/duplicate-spend tests pass against a real
container; daemon closeStores tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants