-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: automatically maintain RPC connections across cluster #70111
Description
Currently, components must themselves dial remote nodes whenever they interact with them, to make sure a connection is established. This mostly applies to Raft, DistSender, DistSQL, and the closed timestamp side transport. These components will also interact with the RPC health checks and circuit breakers.
This can be problematic because it can introduce very high latency (tens of seconds) when interacting with unresponsive nodes (e.g. when the server/VM is shut down or during network unavailability), which is not acceptable for many performance-critical code paths (see #70017). It also muddies the water wrt. who is responsible for running health checks and circuit breaker probes (see #68419).
Instead, we should have a single actor (i.e. goroutine) responsible for maintaining RPC connections to other nodes and performing health checks. The connection health should be exposed in such a way that RPC clients can fail fast whenever they try to interact with a known-bad node.
@tbg has some thoughts on the health check/breaker issues in #68419 (comment), but we should consider extending that proposal to also manage all RPC connections by dialing remote nodes as appropriate.
/cc @cockroachdb/kv
Jira issue: CRDB-9948
gz#13169
Epic CRDB-32137