rpc: automatically maintain RPC connections across cluster

Currently, components must themselves dial remote nodes whenever they interact with them, to make sure a connection is established. This mostly applies to Raft, DistSender, DistSQL, and the closed timestamp side transport. These components will also interact with the RPC health checks and circuit breakers.

This can be problematic because it can introduce very high latency (tens of seconds) when interacting with unresponsive nodes (e.g. when the server/VM is shut down or during network unavailability), which is not acceptable for many performance-critical code paths (see #70017). It also muddies the water wrt. who is responsible for running health checks and circuit breaker probes (see #68419).

Instead, we should have a single actor (i.e. goroutine) responsible for maintaining RPC connections to other nodes and performing health checks. The connection health should be exposed in such a way that RPC clients can fail fast whenever they try to interact with a known-bad node.

@tbg has some thoughts on the health check/breaker issues in https://github.com/cockroachdb/cockroach/issues/68419#issuecomment-906267904, but we should consider extending that proposal to also manage all RPC connections by dialing remote nodes as appropriate.

/cc @cockroachdb/kv 

Jira issue: CRDB-9948

gz#13169


Epic CRDB-32137

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rpc: automatically maintain RPC connections across cluster #70111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rpc: automatically maintain RPC connections across cluster #70111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions