Skip to content
This repository was archived by the owner on Jan 16, 2026. It is now read-only.

fix(node/sysgo-tests): attempt to fix race condition in p2p disconnect rpc call#2761

Merged
theochap merged 7 commits intomainfrom
theo/fix-p2p-race-condition
Aug 26, 2025
Merged

fix(node/sysgo-tests): attempt to fix race condition in p2p disconnect rpc call#2761
theochap merged 7 commits intomainfrom
theo/fix-p2p-race-condition

Conversation

@theochap
Copy link
Copy Markdown
Member

Description

It seems the opp2p_disconnect rpc call returns too early. This method ensures that we're checking the peers are effectively not connected to each other in the opp2p_peers call.

Copilot AI review requested due to automatic review settings August 25, 2025 12:21
@theochap theochap self-assigned this Aug 25, 2025
@theochap theochap added M-tests Meta: Testing related A-node Area: cl node (eq. Go op-node) handles single-chain consensus labels Aug 25, 2025
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Aug 25, 2025

Claude finished @theochap's task —— View job


PR Review Complete ✅

Todo List:

  • Read and analyze the modified file crates/node/rpc/src/p2p.rs
  • Review the race condition fix implementation
  • Assess code quality and best practices
  • Check for potential bugs or issues
  • Evaluate performance considerations
  • Review security concerns
  • Assess test coverage needs
  • Provide comprehensive feedback

Code Review Feedback

Overall Assessment

This is a solid fix for a legitimate race condition issue. The implementation follows existing patterns in the codebase and addresses the core problem effectively.

🔧 Code Quality & Best Practices

Strengths:

  • Consistent Pattern: The fix mirrors the existing retry logic in opp2p_connect_peer (lines 224-252), maintaining consistency across the codebase
  • Proper Error Handling: Uses ErrorObject consistently with appropriate error codes
  • Clear Intent: The comment at lines 270-272 clearly explains why the retry mechanism is needed
  • Good Error Messages: Descriptive error message "Peers are still connected" helps with debugging

Minor Issues:

  • Comment Inconsistency: The comment at line 272 mentions "connect to the peer" but should say "disconnect from the peer" to match the context
  • Missing Metrics: The opp2p_connect_peer method has metrics tracking while opp2p_disconnect_peer doesn't, creating inconsistency

🐛 Potential Bugs/Issues

Low Risk:

  • Timeout vs Network State: The 10-second timeout is reasonable but could potentially be insufficient for very slow networks or heavily loaded systems
  • Concurrent Modifications: If another process connects to the same peer during the disconnect retry loop, it could cause false negatives

Performance Considerations

Good:

  • Bounded Retry: Uses exponential backoff with a total delay limit of 10 seconds, preventing infinite loops
  • Efficient Check: Uses contains_key() for peer lookup, which is O(1) average case
  • Async Operation: Non-blocking implementation using tokio channels

Considerations:

  • Resource Usage: Each disconnect call now creates additional async tasks for retry logic
  • Backpressure: Multiple concurrent disconnect calls could create many retry tasks

🔒 Security Concerns

No Major Issues Identified:

  • The retry mechanism doesn't introduce security vulnerabilities
  • Proper input validation is maintained (peer ID parsing)
  • No exposure of sensitive data in error messages

🧪 Test Coverage Assessment

Current State:

  • No specific tests found for the disconnect functionality
  • The existing test at line 305-316 only covers multiaddr parsing

Recommendations:

  1. Unit Tests: Add tests for the new retry logic in opp2p_disconnect_peer
  2. Integration Tests: Add tests for race condition scenarios
  3. Timeout Tests: Verify behavior when peers don't disconnect within the timeout

📋 Suggested Improvements

  1. Fix Comment (p2p.rs:272):

    // method. We try with an exponential backoff and return an error if we fail to
    // disconnect from the peer.
  2. Add Metrics Consistency:

    async fn opp2p_disconnect_peer(&self, peer_id: String) -> RpcResult<()> {
        kona_macros::inc!(gauge, kona_gossip::Metrics::RPC_CALLS, "method" => "opp2p_disconnectPeer");
        // ... rest of implementation
  3. Consider Configurable Timeout: Make the 10-second timeout configurable via a constant or parameter

  4. Add Test Coverage: Consider adding tests for the retry mechanism

Recommendation

APPROVE with minor suggestions. This PR successfully addresses the race condition issue with a well-implemented solution that follows existing patterns. The suggested improvements are non-blocking and can be addressed in follow-up PRs if desired.


Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a race condition in the opp2p_disconnect RPC call by ensuring the disconnection is fully completed before returning. The fix implements retry logic with exponential backoff to wait for peer disconnection confirmation.

Key Changes

  • Added retry mechanism with exponential backoff to verify peer disconnection
  • Implemented peer connection status checking after disconnect request
  • Added proper error handling for timeout scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +288 to +290
if !is_not_connected
.retry(ExponentialBuilder::default().with_total_delay(Some(Duration::from_secs(10))))
.await?
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry logic imports are missing. This code uses ExponentialBuilder and Duration which require imports that aren't visible in this diff. Ensure the necessary imports are added at the top of the file.

Copilot uses AI. Check for mistakes.
Comment on lines +273 to +288
let is_not_connected = async || {
let (tx, rx) = tokio::sync::oneshot::channel();

self.sender
.send(P2pRpcRequest::Peers { out: tx, connected: true })
.await
.map_err(|_| ErrorObject::from(ErrorCode::InternalError))?;

let peers = rx.await.map_err(|_| {
ErrorObject::borrowed(ErrorCode::InternalError.code(), "Failed to get peers", None)
})?;

Ok::<bool, ErrorObject<'_>>(!peers.peers.contains_key(&peer_id.to_string()))
};

if !is_not_connected
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The closure creates a new oneshot channel on every retry attempt. Consider moving the peer checking logic to a separate method to avoid repeated channel creation overhead during retries.

Suggested change
let is_not_connected = async || {
let (tx, rx) = tokio::sync::oneshot::channel();
self.sender
.send(P2pRpcRequest::Peers { out: tx, connected: true })
.await
.map_err(|_| ErrorObject::from(ErrorCode::InternalError))?;
let peers = rx.await.map_err(|_| {
ErrorObject::borrowed(ErrorCode::InternalError.code(), "Failed to get peers", None)
})?;
Ok::<bool, ErrorObject<'_>>(!peers.peers.contains_key(&peer_id.to_string()))
};
if !is_not_connected
let peer_id_str = peer_id.to_string();
let is_not_connected = || async {
self.is_peer_disconnected(&peer_id_str).await
};
if !is_not_connected()

Copilot uses AI. Check for mistakes.
Comment on lines +292 to +293
return Err(ErrorObject::borrowed(
ErrorCode::InvalidParams.code(),
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error code InvalidParams is misleading for a timeout scenario. This should use InternalError or a more appropriate error code since the parameters were valid but the operation timed out.

Suggested change
return Err(ErrorObject::borrowed(
ErrorCode::InvalidParams.code(),
ErrorCode::InternalError.code(),

Copilot uses AI. Check for mistakes.
@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 25, 2025

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.0%. Comparing base (2de7891) to head (ab3456c).
⚠️ Report is 3 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
crates/node/rpc/src/p2p.rs 0.0% 12 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@refcell refcell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@theochap theochap enabled auto-merge August 25, 2025 14:43
@theochap theochap added this pull request to the merge queue Aug 25, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Aug 25, 2025
…t rpc call (#2761)

## Description

It seems the `opp2p_disconnect` rpc call returns too early. This method
ensures that we're checking the peers are effectively not connected to
each other in the `opp2p_peers` call.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Aug 25, 2025
@theochap theochap added this pull request to the merge queue Aug 25, 2025
github-merge-queue Bot pushed a commit that referenced this pull request Aug 25, 2025
…t rpc call (#2761)

## Description

It seems the `opp2p_disconnect` rpc call returns too early. This method
ensures that we're checking the peers are effectively not connected to
each other in the `opp2p_peers` call.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Aug 25, 2025
@theochap theochap force-pushed the theo/fix-p2p-race-condition branch from 41529ba to ab3456c Compare August 26, 2025 09:49
@theochap theochap added this pull request to the merge queue Aug 26, 2025
Merged via the queue into main with commit 643dd49 Aug 26, 2025
33 of 35 checks passed
@theochap theochap deleted the theo/fix-p2p-race-condition branch August 26, 2025 11:37
@github-project-automation github-project-automation Bot moved this from In Review to Done in Project Tracking Aug 26, 2025
theochap added a commit to ethereum-optimism/optimism that referenced this pull request Dec 10, 2025
…t rpc call (op-rs/kona#2761)

## Description

It seems the `opp2p_disconnect` rpc call returns too early. This method
ensures that we're checking the peers are effectively not connected to
each other in the `opp2p_peers` call.
theochap added a commit to ethereum-optimism/optimism that referenced this pull request Jan 14, 2026
…t rpc call (op-rs/kona#2761)

## Description

It seems the `opp2p_disconnect` rpc call returns too early. This method
ensures that we're checking the peers are effectively not connected to
each other in the `opp2p_peers` call.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A-node Area: cl node (eq. Go op-node) handles single-chain consensus M-tests Meta: Testing related

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants