Skip to content

eth, eth/protocols/eth: fix goroutine leaks in witness request path#2157

Merged
cffls merged 5 commits into
developfrom
witness_goroutine_fix
Mar 24, 2026
Merged

eth, eth/protocols/eth: fix goroutine leaks in witness request path#2157
cffls merged 5 commits into
developfrom
witness_goroutine_fix

Conversation

@cffls

@cffls cffls commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Description

The parallel-stateless-import feature leaked goroutines in RequestWitnessesWithVerification when peers disconnected or stopped responding. Each stalled witness request leaked 2-4 goroutines that pinned witness data and channels in memory indefinitely.

Three root causes:

  1. doWitnessRequest spawned a goroutine that blocked forever on the per-request response channel (<-witResCh) with no timeout or cancellation. When a peer disconnected, the wit dispatcher exited without signaling pending sink channels, so the goroutine never returned. This cascaded: the WaitGroup never completed, witReqResCh never closed, and the adapter goroutine also leaked.

  2. eth.Request.Close() returned early without closing the Cancel channel when peer was nil. The witness adapter returns a shim eth.Request (peer == nil) to the concurrent fetcher. When the fetcher called Close() on timeout or disconnect, Cancel was never closed, so the adapter goroutine's select on Cancel never fired.

  3. The adapter spawned fire-and-forget goroutines (go func(){<-doneCh}()) to drain the Done channel on synthetic eth.Response objects. Nothing ever wrote to these channels, so the goroutines leaked unconditionally.

Fixes:

  • eth.Request.Close() now always closes Cancel even when peer is nil.
  • doWitnessRequest selects on a cancel channel in both receive and forward paths, with proper WaitGroup/semaphore cleanup on exit.
  • Replace doneCh drainer goroutines with buffered channels.
  • Adapter defers closing underlying wit.Request objects on exit.

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

The parallel-stateless-import feature leaked goroutines in
RequestWitnessesWithVerification when peers disconnected or stopped
responding. Each stalled witness request leaked 2-4 goroutines that
pinned witness data and channels in memory indefinitely.

Three root causes:

1. doWitnessRequest spawned a goroutine that blocked forever on the
   per-request response channel (<-witResCh) with no timeout or
   cancellation. When a peer disconnected, the wit dispatcher exited
   without signaling pending sink channels, so the goroutine never
   returned. This cascaded: the WaitGroup never completed, witReqResCh
   never closed, and the adapter goroutine also leaked.

2. eth.Request.Close() returned early without closing the Cancel
   channel when peer was nil. The witness adapter returns a shim
   eth.Request (peer == nil) to the concurrent fetcher. When the
   fetcher called Close() on timeout or disconnect, Cancel was never
   closed, so the adapter goroutine's select on Cancel never fired.

3. The adapter spawned fire-and-forget goroutines (go func(){<-doneCh}())
   to drain the Done channel on synthetic eth.Response objects. Nothing
   ever wrote to these channels, so the goroutines leaked unconditionally.

Fixes:

- eth.Request.Close() now always closes Cancel even when peer is nil.
- doWitnessRequest selects on a cancel channel in both receive and
  forward paths, with proper WaitGroup/semaphore cleanup on exit.
- Replace doneCh drainer goroutines with buffered channels.
- Adapter defers closing underlying wit.Request objects on exit.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review.

Tip: disable this comment in your organization's Code Review settings.

@cffls

cffls commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

@claude review

@claude

claude Bot commented Mar 23, 2026

Copy link
Copy Markdown

Claude finished @cffls's task in 1m 24s —— View job


Code Review: Fix goroutine leaks in witness request path

  • Gather context and read changed files
  • Review eth/protocols/eth/dispatcher.go changes
  • Review eth/peer.go changes
  • Review eth/peer_test.go changes
  • Post final review

Overall this is a well-motivated fix for a real goroutine leak problem. The three root causes are clearly identified and the fixes are targeted. I have a few observations, one of which is a potential correctness issue:

Issue: doWitnessRequest — double wg.Done() / semaphore release on cancel during forward

eth/peer.go:750-769 — In the goroutine spawned by doWitnessRequest, when cancel fires during the forward path (witReqResCh <- ...), the goroutine calls witReqsWg.Done() and <-witReqSem (lines 767-768). But the adapter goroutine in RequestWitnessesWithVerification (lines 383-384) also does <-witReqSem and witReqsWg.Done() for every item received from witReqResCh.

If cancel fires after the response was successfully received from witResCh but before forwarding to witReqResCh, the cancel branch at line 766 fires and does cleanup. This is correct since the message was never sent to the adapter — no double-free occurs. However, there's a subtle race: if witReqResCh <- ... and <-cancel are both ready simultaneously, Go's select picks one randomly. If the send wins, the adapter will also do Done()/semaphore release. If cancel wins, the goroutine does it. The random selection means at most one path runs, so this is actually safe. Just worth noting for reviewers.

Issue: witRes.Done <- nil can block if Done is unbuffered

eth/peer.go:760-762 — After receiving the wit response, the code sends witRes.Done <- nil. If the upstream wit.Response.Done channel happens to be unbuffered and nobody is reading the other end, this will block and the goroutine won't exit even when cancel fires. The code does check witRes.Done != nil but doesn't check if cancel is needed here too:

if witRes != nil && witRes.Done != nil {
    witRes.Done <- nil  // can block indefinitely
}

Consider wrapping this in a select with cancel:

if witRes != nil && witRes.Done != nil {
    select {
    case witRes.Done <- nil:
    case <-cancel:
        witReqsWg.Done()
        <-witReqSem
        return
    }
}

This would make the cancellation path fully robust even if upstream misbehaves. Fix this →

Observation: Request.Close() double-close protection via select

eth/protocols/eth/dispatcher.go:79-84 — The double-close protection using select { case <-r.Cancel: default: close(r.Cancel) } is a well-known Go pattern and works correctly. Good change.

One note: the original code closed Cancel only inside the reqCancel success path (i.e., only for real peers). Now it always closes Cancel, even for real peers where the dispatcher already saw the cancellation. This is fine since the select-based guard prevents double-close panics, and it ensures the channel is always signaled regardless of path.

Observation: Buffered doneCh replacing drainer goroutines

eth/peer.go:400,430 — Replacing go func(){ <-doneCh }() with make(chan error, 1) is the correct fix. The old goroutines would block forever since nothing wrote to them. Buffered channels let the consumer write without blocking and without a drainer. Clean fix.

Observation: Adapter defers closing wit requests

eth/peer.go:370-376 — Deferring witReq.Close() for all underlying wit requests ensures cleanup on any exit path from the adapter goroutine. This is important since the concurrent fetcher only has access to the *eth.Request shim and can't close the wit-level requests directly.

Note: iterating over wrapperReq.witReqs in the defer is safe because witReqs is only appended to (under lock) by buildWitnessRequests, and by the time the adapter exits (after witReqResCh is closed → after witReqsWg.Wait() completes), all request-building is done.

Tests

The test coverage is thorough:

  • TestRequestCloseShimClosesCancelChannel — covers the dispatcher fix including double-close safety
  • TestDoWitnessRequestExitsOnCancel — directly verifies goroutine cleanup on cancel
  • TestRequestWitnessesNoGoroutineLeakOnCancel — end-to-end test for the full flow
  • TestRequestWitnessesDoneChannelBuffered — verifies the buffered channel fix

The goroutine-counting approach with slack (goroutinesBefore+2) is pragmatic for leak detection tests.

Summary

The PR correctly addresses all three root causes of goroutine leaks. The changes are well-scoped and the test coverage is good. The one actionable item is wrapping witRes.Done <- nil in a select with cancel to prevent a potential indefinite block if Done is ever unbuffered upstream.


@codecov

codecov Bot commented Mar 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.96774% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.57%. Comparing base (2a0bfe7) to head (75d4eb8).
⚠️ Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
eth/protocols/eth/dispatcher.go 0.00% 18 Missing ⚠️

❌ Your patch status has failed because the patch coverage (70.96%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2157      +/-   ##
===========================================
+ Coverage    51.51%   51.57%   +0.05%     
===========================================
  Files          882      882              
  Lines       154140   154164      +24     
===========================================
+ Hits         79403    79504     +101     
+ Misses       69559    69494      -65     
+ Partials      5178     5166      -12     
Files with missing lines Coverage Δ
eth/peer.go 95.80% <100.00%> (+6.89%) ⬆️
eth/protocols/eth/dispatcher.go 5.64% <0.00%> (-0.05%) ⬇️

... and 23 files with indirect coverage changes

Files with missing lines Coverage Δ
eth/peer.go 95.80% <100.00%> (+6.89%) ⬆️
eth/protocols/eth/dispatcher.go 5.64% <0.00%> (-0.05%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@cffls cffls force-pushed the witness_goroutine_fix branch from 668179d to 75d4eb8 Compare March 24, 2026 00:14
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
4.9% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@kamuikatsurgi kamuikatsurgi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I would test that a node can do a graceful shutdown. We've faced issues with go routines blocking bor from doing a graceful shutdown.

@cffls

cffls commented Mar 24, 2026

Copy link
Copy Markdown
Contributor Author

LGTM - I would test that a node can do a graceful shutdown. We've faced issues with go routines blocking bor from doing a graceful shutdown.

Thanks - tested multiple times locally and the node was able to shutdown gracefully.

@cffls cffls merged commit 0d6ee4b into develop Mar 24, 2026
19 of 21 checks passed
milando12 added a commit to milando12/bor that referenced this pull request Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants