Skip to content

rpc: automatically maintain RPC connections across cluster #70111

@erikgrinaker

Description

@erikgrinaker

Currently, components must themselves dial remote nodes whenever they interact with them, to make sure a connection is established. This mostly applies to Raft, DistSender, DistSQL, and the closed timestamp side transport. These components will also interact with the RPC health checks and circuit breakers.

This can be problematic because it can introduce very high latency (tens of seconds) when interacting with unresponsive nodes (e.g. when the server/VM is shut down or during network unavailability), which is not acceptable for many performance-critical code paths (see #70017). It also muddies the water wrt. who is responsible for running health checks and circuit breaker probes (see #68419).

Instead, we should have a single actor (i.e. goroutine) responsible for maintaining RPC connections to other nodes and performing health checks. The connection health should be exposed in such a way that RPC clients can fail fast whenever they try to interact with a known-bad node.

@tbg has some thoughts on the health check/breaker issues in #68419 (comment), but we should consider extending that proposal to also manage all RPC connections by dialing remote nodes as appropriate.

/cc @cockroachdb/kv

Jira issue: CRDB-9948

gz#13169

Epic CRDB-32137

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-serverRelating to the KV-level RPC serverA-server-networkingPertains to network addressing,routing,initializationC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)T-kvKV Team

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions