fix(utxo/aerospike): stop caller-side goroutine/connection leak in batchers#1025
Conversation
…tchers
A production node accumulated 13,693 goroutines (8,664 parked in
(*Store).get's select, 4,096 in validator errgroups) that persisted after
the underlying aerospike wedge cleared and client_connections returned to
0. Root cause is teranode-side, not the v8 client:
- go-batcher recovers panics raised in the batch dispatch fn, so a panic
part-way through a sendXxxBatch (e.g. an unchecked type assertion in
getTxFromBins on a malformed bin) left every not-yet-signalled per-item
completion channel orphaned. The submitters then parked forever because
the contexts threaded down from legacy sync/validation carry no deadline.
- sendGetBatch wrapped the already-retrying v8 batch call in another 3x
retry loop, stacking the worst-case stall to ~3x TotalTimeout (~15m).
Fixes:
- Panic safety net (signalBatchPanic + trySignal) on all 7 batch dispatch
fns: a recovered panic now re-signals every per-item channel instead of
orphaning waiters.
- Bound every submitter wait with Store.batcherWait (batch policy
TotalTimeout + grace): get, PreviousOutputsDecorate, Create,
SetDAHForChildRecords, SetLocked. A wedged batcher releases the caller
after the bound rather than for the life of the process.
- sendGetBatch: drop the redundant outer 3x retry; rely on the v8 policy.
- Per-fn correctness:
- setLockedBatch: replace os.Exit(1) on key error with a per-item error;
signal the previously-silent missing-LuaSuccess-bin path; bound the
same-pool child-record recursion (breaks a worker deadlock).
- sendSetDAHBatch: signal + NOOP placeholder on key-skip (was a nil
batchRecords slot -> nil-deref panic and an orphaned errCh); fix a
%s-without-arg format bug.
- sendIncrementBatch: fix index desync after a key-creation skip via
placeholder + handled[] tracking.
- sendSpendBatchLua/executeSpendBatch: replace double-signal on
already-completed buffered-1 items with non-blocking trySignal.
- sendOutpointBatch: bounds-check the previous-output index.
- Route BatchDecorate's BatchOperate through the batchOperateFn test seam.
Tests: signalBatchPanic unit test; sendGetBatch panic-orphan and
bounded-wait regression tests. build/vet/staticcheck/golangci-lint/gofmt
clean; new tests pass under -race; aerospike integration tests (increment,
setDAH, decorate, split-tx, TestAerospike, spend) green.
|
🤖 Claude Code Review Status: Complete SummaryProduction-critical goroutine leak fix with comprehensive panic safety guarantees and bounded waits. The changes directly address the root cause identified in the production incident (13,693 leaked goroutines). Key Strengths
Implementation Qualitybatch_completion.go (new):
All dispatch functions now have:
Additional fixes:
ObservationsNo blocking issues. Code is safe to merge pending human review and passing tests. Note on same-pool recursion (locked.go:187-202): Review methodology: Static analysis only (fork PR security constraints) |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-04 12:09 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve with comments. The leak diagnosis is sound and the layered defense (panic-signal + bounded wait) is the right shape. Findings below were verified against the source, not just the diff.
Strengths
- Excellent root-cause writeup; correctly separates the caller-side bug from the v8 wedge and scopes out the latent v8
connection.godeadline-reset issue. trySignalsemantics are correct for the buffered-1 channels (and the doc comment correctly warns against unbuffered use).- Real collateral bugs fixed:
os.Exit(1)-> per-item error insetLockedBatch; index desync after key-skip insendIncrementBatch/sendSetDAHBatch(placeholder +handled[]keepsbatchRecords1:1 withbatch); stale-errin the increment result loop; newbatchRecord.Record == nilguard before.Bins. - Buffering audit holds:
get.done, outpointerrChan, incrementres, spenderrCh, lockederrChare all buffered-1, andSetDAHForChildRecordswas correctly upgraded unbuffered -> buffered-1 here.
Concerns
1. Create's errCh is the lone unbuffered completion channel (Medium - consistency/robustness).
create.go:180 is make(chan error) + defer close(errCh), while every other completion channel in the package is buffered-1. The new ctx.Done()/timeoutCh arms mean Create can now depart before sendStoreBatch sends. I traced this: it does not permanently wedge, because worker sends go through util.SafeSend (has defer recover()) and the deferred close turns a post-departure send into a recovered "send on closed channel" panic. But that relies on recover-on-closed plus a benign select race, whereas everywhere else the send simply lands in the buffer. Recommend make(chan error, 1) for consistency (the resultHandledElsewhere guard ensures at most one send per item, so buffer-1 suffices).
2. setLockedBatch child recursion onto the same pool (Medium - latent, now bounded not eliminated).
setLockedBatch runs on a lockedBatcher worker and re-submits child items to the same lockedBatcher, then waits. Under enough concurrency this is pool-exhaustion deadlock. waitForLockedResult converts the previous permanent hang into a ServiceUnavailable after batcherWait (a strict improvement), but the "breaks a worker deadlock" wording overstates it - multi-record SetLocked can still fail with timeouts under load. Good follow-up: dispatch child records onto a separate pool or process inline. Not a blocker.
3. Dropping the nested retry in sendGetBatch (Low - confirm intent).
Well-justified by the amplification math. Only semantic shift: the old loop retried on any BatchDecorate error, including classes the v8 policy may not retry internally; those now surface immediately. Acceptable since get callers re-request.
4. Test coverage is partial (Low).
New unit tests cover signalBatchPanic, the get panic fan-out, and the get bounded wait. The bounded-wait/panic-guard paths for Create, PreviousOutputsDecorate, sendOutpointBatch, setLockedBatch, sendSpendBatchLua, sendIncrementBatch, sendSetDAHBatch rest on the live-Aerospike integration run. Shared signalBatchPanic/trySignal mitigate this; a table test per dispatch fn's panic guard would lock in the per-fn closures (e.g. the it != nil guard in spend).
Nits
var sendErr error = errors.New...()is redundant - channels are already typed; pass the*errors.Errordirectly totrySignal.
Only ask before merge is #1 (buffered-1 Create errCh). #2 and #4 are good follow-ups, not blockers.
Note: I verified channel buffering, placeholderKey/sendErrorAndClose, batcher pool wiring, and SafeSend semantics against source. I did not run the test suite myself; the PR reports it green.
…e BatchOperate seam Adds no-container unit tests that drive each dispatch fn through its panic, BatchOperate-error, and result paths so the leak-fix branches are covered by the unit-test run that feeds Sonar (the aerospike integration tests run in a separate gate that does not contribute coverage): - sendIncrementBatch, sendSetDAHBatch, setLockedBatch, sendSpendBatchLua, sendStoreBatch: assert every per-item channel is signalled on panic and on BatchOperate error (no orphaned submitters); setLockedBatch also exercises the previously-silent missing-LuaSuccess-bin path. To enable this without a live Aerospike, route the remaining direct s.client.BatchOperate calls through a single Store.batchOperate seam that honours the existing test-only batchOperateFn override (get and create already used it). Behaviour is unchanged when the override is nil.
…chain#1025 review) Make Create's errCh buffered-1, matching every other completion channel in the package. With the new bounded/cancellable wait, Create can depart before sendStoreBatch sends; a buffered channel lets that send land in the buffer rather than relying on the deferred close turning it into a recovered send-on-closed. The resultHandledElsewhere guard already bounds it to one send.
|
ordishs
left a comment
There was a problem hiding this comment.
Re-review (after 9a21b5b + a1ac5c0)
Both pre-merge items from the prior review are resolved, plus an unrequested improvement:
- #1 (Create errCh) — fixed.
create.go:185is nowmake(chan error, 1). With buffered-1 the timeout/ctx-departure race no longer relies on send-on-closed-channel recovery: a late worker send either lands in the buffer or is discarded by the deferred close without panicking. Consistent with every other completion channel now. - Seam centralized (nice-to-have, done). All dispatch fns route through a single
(*Store).batchOperate(aerospike.go:162) instead of scattereds.client.BatchOperatecalls. This is cleaner than the originalBatchDecorate-only seam and makes the new failure-path tests possible across every batcher. - #4 (dispatch failure-path coverage) — largely addressed.
batch_dispatch_leak_test.godrivessendIncrementBatch,sendSetDAHBatch,setLockedBatch,sendSpendBatchLua, andsendStoreBatchthrough panic / BatchOperate-error / ok branches via the seam, asserting every item is signalled (no orphans). Ran the new suite locally under-race— green;go vetclean.
Residual (non-blocking) gaps:
- No unit test for
sendOutpointBatch's panic guard + output-index bounds check (it reads via a different API, so thebatchOperateseam doesn't reach it). - Caller-side bounded-wait timeout tests remain
get-only;Create/PreviousOutputsDecorate/SetLocked/SetDAHForChildRecordstimeout arms are covered only by the live-Aerospike run. - #2 (
setLockedBatchsame-pool child recursion) still a latent deadlock now bounded to a timeout — good follow-up, not a blocker.
Approval stands.
ordishs
left a comment
There was a problem hiding this comment.
Approve. Both pre-merge items from the earlier review are resolved and verified.
- Create errCh is now buffered-1 (
create.go:185), consistent with every other completion channel; the timeout/ctx-departure path no longer relies on send-on-closed-channel recovery. - BatchOperate seam centralized into
(*Store).batchOperate(aerospike.go:162) — all dispatch fns route through it. - Dispatch failure-path coverage added in
batch_dispatch_leak_test.gofor increment / setDAH / locked / spend / store across panic / BatchOperate-error / ok branches, asserting no item is orphaned.
Verified locally: the new dispatch/panic/bounded-wait suite passes under -race, go vet clean.
Residual follow-ups (non-blocking): no unit test for sendOutpointBatch's panic guard + index bounds-check; caller-side bounded-wait tests remain get-only; setLockedBatch same-pool child recursion is now bounded to a timeout rather than eliminated.
LGTM to merge.
freemans13
left a comment
There was a problem hiding this comment.
Approving — root-cause analysis is precise and the fix is correctly layered (panic safety net + bounded batcherWait keystone + nested-retry removal), with strong race-tested coverage across every dispatch fn. Verified locally: go vet clean, 20 new tests pass under -race.
Genuine bugs fixed alongside the leak (not just symptoms): setLockedBatch os.Exit(1)→per-item error, sendIncrementBatch index desync + wrong-err logging, sendSetDAHBatch nil-slot nil-deref + format bug, sendOutpointBatch output-index bounds check.
Non-blocking follow-ups to consider:
batcherWaitclocks from enqueue whileTotalTimeoutbounds only the batch op — under a deep backlog a healthy-but-slow batch could trip a spurious ServiceUnavailable. Worth sanity-checking the 30s grace against peak queueing latency.Createnow returns on ctx-cancel/timeout whilesendStoreBatchcompletes the write independently — callers must tolerate 'error returned, write may still land' (likely fine given TxExists idempotency). Worth a doc note.setLockedBatchchild recursion still pins a worker for up tobatcherWait— correctly bounded now, full elimination tracked in #1033.
Minor: the bounded-wait block is hand-inlined in 4 places while SetLocked got a waitForLockedResult helper — a single shared helper would prevent drift.
Resolves a semantic conflict in stores/utxo/aerospike/locked.go where both branches rewrote setLockedBatch: - upstream bsv-blockchain#1025 ("stop caller-side leak"): trySignal, handled[] tracking, panic-recovery defer (signalBatchPanic), placeholderKey key-error handling, batchOperate wrapper, and bounded waitForLockedResult — but kept the child self-requeue (lockedBatcher.PutCtx from inside the callback). - this branch (bsv-blockchain#928): removed the self-requeue, handling child records inline, because bsv-blockchain#928 closes the lockedBatcher on shutdown and the self-requeue then Puts into a closed channel (panic in an errgroup goroutine -> process crash), which upstream's bounded wait does not prevent. Resolution: keep ALL of upstream's scaffolding and replace the child self-requeue with the inline child-record BatchOperate (using batchOperate + trySignal). Verified: build + go vet (incl. -tags aerospike); aerospike close-drain + lock/multirecord/duplicate-spend tests pass against a real container; daemon closeStores tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>


Problem
A production node (
bsva-ovh-teranode-eu-2) accumulated 13,693 goroutines — 8,664 parked in(*Store).get'sselectand 4,096 in validator errgroups — that persisted after the underlying Aerospike wedge cleared andclient_connectionshad returned to 0. A leak that survives the wedge is not the wedge; it's a caller-side bug holding the resources.Root cause (teranode-side, not the v8 client)
dispatchAndRecordwrapsb.fn(batch)indefer recover()). A panic part-way through asendXxxBatch— e.g. an unchecked type assertion ingetTxFromBinson a malformed/missing bin — left every not-yet-signalled per-item completion channel orphaned. The worker survived; the submitters parked forever, because the contexts threaded down from legacy sync / validation carry no deadline, so thegetselect'sctx.Done()arm never fires.sendGetBatchwrapped the already-retrying v8 batch call (MaxRetries withinTotalTimeout=5m) in another 3× retry loop, stacking the worst-case stall to ~3× TotalTimeout (~15m) and growing the submitter backlog faster than it drained under sustained server slowness.The v8 client itself bounds the batch read at
TotalTimeout(5m) and releases the connection — verified againstv8.7.1-bsv3. No v8 change is required to stop this leak. (One latent v8 robustness bug —connection.goresets the socket deadline on every partial read, soSocketTimeoutis per-syscall and a slow-drip server evades it; only bites callers withTotalTimeout=0, which teranode is not — flagged separately, out of scope here.)Changes
signalBatchPanic+trySignal(newbatch_completion.go); a recover-defer on all 7 dispatch fns (sendGetBatch,sendOutpointBatch,sendStoreBatch,sendSpendBatchLua,sendIncrementBatch,sendSetDAHBatch,setLockedBatch) re-signals every per-item channel on panic. Non-blocking, so it never double-delivers or wedges the worker.Store.batcherWait(= batch policyTotalTimeout+ 30s grace) on(*Store).get,PreviousOutputsDecorate,Create,SetDAHForChildRecords,SetLocked. A wedged batcher now releases the caller after the bound instead of for the life of the process.sendGetBatch— rely on the v8 policy's own retries.setLockedBatch:os.Exit(1)on a key error → per-item error; signal the previously-silent missing-LuaSuccess-bin path; bound the same-pool child-record recursion with a timeout — this converts a potential pool-exhaustion hang into a boundedServiceUnavailable; fully eliminating the recursion is tracked in setLockedBatch: same-pool child-record recursion can exhaust the lockedBatcher worker pool #1033.sendSetDAHBatch: signal + NOOP placeholder on key-skip (was a nilbatchRecordsslot → nil-deref panic + orphanederrCh); fix a%s-without-arg format bug.sendIncrementBatch: fix index desync after a key-creation skip (placeholder +handled[]).sendSpendBatchLua/executeSpendBatch: replace double-signal on already-completed buffered-1 items with non-blockingtrySignal.sendOutpointBatch: bounds-check the previous-output index.BatchDecorate'sBatchOperatethrough the existingbatchOperateFntest seam.Verification
go build ./...,go vet,staticcheck,golangci-lint(0 issues),gofmtall clean. New unit tests pass under-race. Live-container Aerospike integration green: increment, setDAH/drift, decorate, split-tx (SetLocked child recursion),TestAerospike, spend (dup-spender / multi-record / unspend / nil-panic). Full package unit suite passes.Review follow-ups (@ordishs)
Create'serrChis now buffered-1, matching every other completion channel (it was the lone unbuffered one; the new bounded wait meantCreatecould depart beforesendStoreBatchsends).batch_dispatch_leak_test.go).setLockedBatch's same-pool child recursion — setLockedBatch: same-pool child-record recursion can exhaust the lockedBatcher worker pool #1033.