fix(legacy): prevent shutdown-race panic in QueueInv (send on closed channel)#1009
Conversation
…channel) Inv delivery runs on peer read-loop goroutines (Peer.inHandler -> OnInv -> SyncManager.QueueInv), but the channels QueueInv sends to are torn down by a different goroutine during shutdown: - legacyKafkaInvCh is closed by the kafka async producer's Stop() (util/kafka/kafka_producer_async.go), which fires on context cancellation / SIGTERM; and - msgChan stops being drained once the block handler exits on quit. QueueInv guards with an atomic shutdown-flag check, but a flag check and a channel send are not atomic against a concurrent close, so a late inv could still hit a closed channel and panic — and "send on closed channel" is an unrecoverable panic that crashes the whole process. In practice this fired on nearly every SIGTERM taken while syncing (peers are continuously announcing invs), turning routine restarts into hard crashes mid-shutdown. Route QueueInv's sends through sendDuringShutdown, a tiny generic helper that recovers from the send-on-closed-channel panic and drops the inv. Dropping an inv during shutdown is safe: inv is an advisory announcement, re-sent by the peer (or a later session) on reconnect. The shutdown-flag check is kept as the cheap common-case short-circuit. Adds tests: the helper returns false (no panic) on a closed channel, and QueueInv no longer panics when legacyKafkaInvCh / msgChan are closed underneath it (kafka, block, and non-kafka paths). Note: the deeper cause is the kafka producer closing its *input* channel (a consumer closing a channel that still has senders); that lives in shared infra and is left for a separate change. This fix makes the legacy inv path robust to it regardless.
|
🤖 Claude Code Review Status: Complete Current Review: This PR fixes a critical shutdown-race panic in Issues found:
Positive observations:
Recommendation: Address the other Queue* functions before merge to prevent the same crash through different code paths. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-02 09:05 UTC |
| @@ -2097,6 +2097,28 @@ func (sm *SyncManager) QueueBlock(block *bsvutil.Block, peer *peerpkg.Peer, done | |||
| sm.msgChan <- &blockMsg{block: block, peer: peer, reply: done} | |||
There was a problem hiding this comment.
[Major] The same shutdown race exists in QueueBlock (line 2097), QueueTx (line 2084), and QueueHeaders (line 2186).
All three functions follow the identical pattern:
- Check
atomic.LoadInt32(&sm.shutdown) - Perform channel send to
sm.msgChan
Like QueueInv, these are called from peer read-loop goroutines while msgChan is closed/drained on shutdown by a different goroutine. The atomic check and channel send are not atomic against concurrent close.
Current code at manager.go:2097:
sm.msgChan <- &blockMsg{block: block, peer: peer, reply: done}Should be:
sendDuringShutdown[interface{}](sm.msgChan, &blockMsg{block: block, peer: peer, reply: done})Same fix needed for QueueTx:2084 and QueueHeaders:2186.
| // write to Kafka | ||
| sm.logger.Debugf("writing INV message to Kafka from peer %s, length: %d", peer.String(), len(value)) | ||
| sm.legacyKafkaInvCh <- &kafka.Message{ | ||
| sendDuringShutdown(sm.legacyKafkaInvCh, &kafka.Message{ |
There was a problem hiding this comment.
[Minor] The return value of sendDuringShutdown is not used. While dropping invs during shutdown is safe (as documented), the function returns bool to indicate whether the send succeeded, which might be useful for logging or metrics.
Consider either:
- Using the return value for observability:
if !sendDuringShutdown(sm.legacyKafkaInvCh, &kafka.Message{Value: value}) {
sm.logger.Debugf("dropped inv to closed kafka channel during shutdown")
}- Or changing the signature to not return a value if it will never be used.
|
There was a problem hiding this comment.
Pull request overview
Prevents a legacy sync shutdown race from crashing the process with panic: send on closed channel when SyncManager.QueueInv sends to channels that may be closed concurrently during teardown (notably the Kafka async producer input channel).
Changes:
- Added a small
sendDuringShutdownhelper that recovers fromsend on closed channeland reports whether the send succeeded. - Routed
QueueInv’s sends (tomsgChanandlegacyKafkaInvCh) through the helper to avoid process-killing panics. - Added tests covering helper behavior and ensuring
QueueInvdoes not panic when target channels are closed.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| services/legacy/netsync/manager.go | Introduces sendDuringShutdown and uses it in QueueInv to prevent shutdown-time panics when channels are closed concurrently. |
| services/legacy/netsync/manager_test.go | Adds unit tests to validate the helper and regression tests to ensure QueueInv doesn’t panic when channels are closed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func sendDuringShutdown[T any](ch chan T, v T) (sent bool) { | ||
| defer func() { | ||
| if recover() != nil { | ||
| sent = false | ||
| } |
ordishs
left a comment
There was a problem hiding this comment.
Approve. Solid, well-reasoned fix for a real process-wide crash, with an excellent write-up of root cause and downstream impact.
Verified the core claim: KafkaAsyncProducer.Stop() does close(ch) on the same channel passed to Start() (i.e. legacyKafkaInvCh), so that path is a genuine send-on-closed panic source and sendDuringShutdown correctly neutralizes it. The recover() is appropriately narrow in practice — a channel send can only panic for send-on-closed, so it won't mask unrelated bugs.
One non-blocking observation: msgChan is never close()d in production (only in the tests) — the block handler stops draining it but doesn't close it. So on the msgChan paths the send blocks once the buffer fills rather than panicking, and recover() never fires there. The wrapping is harmless belt-and-suspenders, but the doc comment slightly overstates "the block handler stops draining msgChan" as a panic source on par with the kafka close — the only real panic vector is legacyKafkaInvCh. Worth tightening the comment (and noting the msgChan-closed tests exercise helper behaviour rather than a faithful prod scenario). Not a blocker.



Problem
Legacy sync crashes the whole process on shutdown with
panic: send on closed channel, originating inSyncManager.QueueInv:Inv delivery runs on peer read-loop goroutines (
Peer.inHandler → OnInv → QueueInv), but the channelsQueueInvsends to are torn down by a different goroutine during shutdown:legacyKafkaInvChis closed by the kafka async producer'sStop()(util/kafka/kafka_producer_async.go), which fires on context cancellation / SIGTERM;msgChanstops being drained once the block handler exits onquit.QueueInvguards with an atomic shutdown-flag check, but a flag check and a channel send are not atomic against a concurrentclose— so a late inv slips past the check and sends on an already-closed channel.send on closed channelis an unrecoverable panic, so it takes the whole node down. Because peers announce invs continuously while syncing, this fires on nearly every SIGTERM taken mid-sync, turning routine restarts into hard crashes.Observed downstream impact
The crash happens mid-shutdown, so graceful teardown never completes: in-flight UTXO-store postgres queries are abandoned with their server-side backends still alive holding locks (e.g. on
txs_p00). On the next boot, the UTXO store's schema step (ALTER TABLE/CREATE INDEXontxs_p00) blocks on that orphaned lock → the node won't boot until the stuck backend is manually terminated. So a single shutdown-race panic can wedge the node.Fix
Route
QueueInv's sends throughsendDuringShutdown, a small generic helper that recovers from the send-on-closed-channel panic and drops the inv. Dropping an inv during shutdown is safe — inv is an advisory announcement, re-sent by the peer (or a later session) on reconnect. The atomic shutdown-flag check is kept as the cheap common-case short-circuit.Tests
TestSendDuringShutdown: helper delivers on an open channel and returnsfalse(no panic) on a closed one.TestQueueInv_NoPanicWhenChannelsClosedDuringShutdown:QueueInvno longer panics whenlegacyKafkaInvCh/msgChanare closed underneath it (kafka, block, and non-kafka paths).services/legacy/netsyncsuite passes under-race(120 tests);go vet+go build ./...clean.Note / follow-up
The deeper cause is the kafka producer closing its input channel (a consumer closing a channel that still has senders,
kafka_producer_async.goStop()→close(ch)). That lives in shared infra used by many services, so it's left for a separate change; this fix makes the legacy inv path robust regardless.