fix(blockassembly): guard subtree worker ErrChan send against shutdown deadlock#1028
Conversation
…n deadlock The subtree storage worker sends its result on the caller's ErrChan with a bare, unbuffered send. The caller (SubtreeProcessor) abandons the matching receive on its own context cancellation (select with processorCtx.Done), so on shutdown the worker's send blocks forever — which hangs runNewSubtreeListener's wg.Wait() and deadlocks block-assembly shutdown. The resultChan send just above it is already ctx-guarded; the ErrChan send was not. Extract the send into sendCallerErr, which selects on ctx.Done so it releases on shutdown. Dropping the result once ctx is cancelled is safe — the caller is no longer waiting for it. Add tests: sendCallerErr releases on ctx cancel with no reader (reverting it to a bare send fails this with the deadlock), delivers to a present reader, and is a no-op for a nil channel.
|
🤖 Claude Code Review Status: Complete Current Review: No issues found. This is a well-designed fix for a subtle shutdown deadlock. Summary: The PR correctly addresses a shutdown deadlock where the subtree storage worker blocked indefinitely on an unbuffered ErrChan send when the SubtreeProcessor receiver abandoned its receive due to context cancellation. The fix extracts the send into sendCallerErr, which uses a select statement to abandon the send when ctx is cancelled, mirroring the existing pattern for resultChan. Strengths:
|
|
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-03 11:01 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve. Verified the full deadlock chain: the worker's bare unbuffered send on work.request.ErrChan had no receiver during shutdown (the SubtreeProcessor abandons the matching receive on processorCtx.Done()), hanging runNewSubtreeListener's wg.Wait(). The sendCallerErr ctx-guard faithfully mirrors the existing resultChan pattern. Minimal, well-documented, and the AbandonsOnCancel test genuinely reproduces the bug. go test -race passes locally.



Bug (shutdown deadlock)
The subtree storage worker (
subtreeStorageWorker) sends its result back to the caller onwork.request.ErrChanwith a bare, unbuffered send:But the caller — the
SubtreeProcessor— abandons the matching receive on its own context cancellation:So on shutdown the worker's send has no receiver and blocks forever. That keeps the worker's
for work := range workChanloop from returning, sorunNewSubtreeListener'swg.Wait()(in thectx.Done()shutdown path) never completes — block-assembly shutdown deadlocks. TheresultChansend immediately above is alreadyselect-guarded onctx.Done(); thisErrChansend was the one that wasn't.Fix
Extract the send into
sendCallerErr, whichselects onctx.Done()and so releases on shutdown. Dropping the result once ctx is cancelled is safe — the caller has already abandoned the receive and isn't waiting for it. This mirrors the existing ctx-guardedresultChansend.Tests
send_caller_err_test.go:sendCallerErrblocks until ctx is cancelled, then returns. Reverting the helper to a bare send makes this fail (reproduces the deadlock: "blocked after ctx cancel").Verification
AbandonsOnCancel(hangs → fatal); guarded version passes.go test -raceon the new tests: pass.go build,go vet,gofmt,gci: clean.Second fix from the concurrency/lifecycle audit (after #1027, the kafka final-drain). The audit's remaining items: the netsync
Queue*shutdown hang and theDiskTxMap.bytesWrittenrace.