eth, eth/protocols/eth: fix goroutine leaks in witness request path#2157
Conversation
The parallel-stateless-import feature leaked goroutines in
RequestWitnessesWithVerification when peers disconnected or stopped
responding. Each stalled witness request leaked 2-4 goroutines that
pinned witness data and channels in memory indefinitely.
Three root causes:
1. doWitnessRequest spawned a goroutine that blocked forever on the
per-request response channel (<-witResCh) with no timeout or
cancellation. When a peer disconnected, the wit dispatcher exited
without signaling pending sink channels, so the goroutine never
returned. This cascaded: the WaitGroup never completed, witReqResCh
never closed, and the adapter goroutine also leaked.
2. eth.Request.Close() returned early without closing the Cancel
channel when peer was nil. The witness adapter returns a shim
eth.Request (peer == nil) to the concurrent fetcher. When the
fetcher called Close() on timeout or disconnect, Cancel was never
closed, so the adapter goroutine's select on Cancel never fired.
3. The adapter spawned fire-and-forget goroutines (go func(){<-doneCh}())
to drain the Done channel on synthetic eth.Response objects. Nothing
ever wrote to these channels, so the goroutines leaked unconditionally.
Fixes:
- eth.Request.Close() now always closes Cancel even when peer is nil.
- doWitnessRequest selects on a cancel channel in both receive and
forward paths, with proper WaitGroup/semaphore cleanup on exit.
- Replace doneCh drainer goroutines with buffered channels.
- Adapter defers closing underlying wit.Request objects on exit.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review |
|
Claude finished @cffls's task in 1m 24s —— View job Code Review: Fix goroutine leaks in witness request path
Overall this is a well-motivated fix for a real goroutine leak problem. The three root causes are clearly identified and the fixes are targeted. I have a few observations, one of which is a potential correctness issue: Issue:
|
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (70.96%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2157 +/- ##
===========================================
+ Coverage 51.51% 51.57% +0.05%
===========================================
Files 882 882
Lines 154140 154164 +24
===========================================
+ Hits 79403 79504 +101
+ Misses 69559 69494 -65
+ Partials 5178 5166 -12
... and 23 files with indirect coverage changes
🚀 New features to boost your workflow:
|
668179d to
75d4eb8
Compare
|
kamuikatsurgi
left a comment
There was a problem hiding this comment.
LGTM - I would test that a node can do a graceful shutdown. We've faced issues with go routines blocking bor from doing a graceful shutdown.
Thanks - tested multiple times locally and the node was able to shutdown gracefully. |
…t path (0xPolygon#2157)" This reverts commit 0d6ee4b.


Description
The parallel-stateless-import feature leaked goroutines in RequestWitnessesWithVerification when peers disconnected or stopped responding. Each stalled witness request leaked 2-4 goroutines that pinned witness data and channels in memory indefinitely.
Three root causes:
doWitnessRequest spawned a goroutine that blocked forever on the per-request response channel (<-witResCh) with no timeout or cancellation. When a peer disconnected, the wit dispatcher exited without signaling pending sink channels, so the goroutine never returned. This cascaded: the WaitGroup never completed, witReqResCh never closed, and the adapter goroutine also leaked.
eth.Request.Close() returned early without closing the Cancel channel when peer was nil. The witness adapter returns a shim eth.Request (peer == nil) to the concurrent fetcher. When the fetcher called Close() on timeout or disconnect, Cancel was never closed, so the adapter goroutine's select on Cancel never fired.
The adapter spawned fire-and-forget goroutines (go func(){<-doneCh}()) to drain the Done channel on synthetic eth.Response objects. Nothing ever wrote to these channels, so the goroutines leaked unconditionally.
Fixes:
Changes
Breaking changes
Please complete this section if any breaking changes have been made, otherwise delete it
Nodes audience
In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)
Checklist
Cross repository changes
Testing
Manual tests
Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it
Additional comments
Please post additional comments in this section if you have them, otherwise delete it