rpc: nodes fail to connect to peer even after the peer is up

I believe I've run into the following race:
- node 1 goes down
- some node 2 code path tries to dial node 1. Let's call it dial attempt 1.
- node 1 comes back, opens its ports
- another code path on node 2 tries to dial. This is dial attempt 2. Dial attempt 1 is still in progress, so dial attempt two blocks on [`conn.initOnce`](https://github.com/cockroachdb/cockroach/blob/b41ac60eb627f074486e1e31e3eefd019851bff7/pkg/rpc/context.go#L979)
- dial attempt one was in the process of failing, and eventually releases that lock
- every subsequent attempt (i.e. dial attempt 2) that was convoyed behind attempt 1 fails with the same message

This is quite unfortunate, particularly in situations where dial attempt 2 is causally related to node 1 coming back. For example, by node 1 having sent a `SetupFlow` RPC, which asks node 2 to connect back to it. Node 2 failing to connect back is very unfortunate for the respective SQL query, which will wait in vain for a long time.

A suggested fix is by having node 2 consider network availability signals (incoming connections, or successful heartbeats) and making sure that no error from a dial attempt from before a signal is propagated to any dial attempt from after the signal.

cc @ajwerner @bdarnell 

Epic: CRDB-8500

Jira issue: CRDB-5255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: nodes fail to connect to peer even after the peer is up #44101

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rpc: nodes fail to connect to peer even after the peer is up #44101

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions