Skip to content

rpc: Dialer.ConnHealth must not cause network IO #69888

@erikgrinaker

Description

@erikgrinaker

Dialer.ConnHealth is used to check whether a healthy RPC connection exists to a given node. This is often done to avoid interacting with unhealthy nodes. However, internally this method actually dials the remote node if no connection already exists:

conn := n.rpcContext.GRPCDialNode(addr.String(), nodeID, class)
return conn.Health()

The method is called synchronously in several performance-critical code paths, including the main Replica.handleRaftReady Raft processing path (inside Replica.updateProposalQuotaRaftMuLocked) and the Store.raftTickLoop Raft tick path (inside Store.updateLivenessMap). This is problematic because the node dial can hang for a long time — in particular, if the remote IP address or DNS server does not respond at all (which can happen e.g. with power loss, VM shutdown, network connectivity problems, etc) the TCP/IP stack or DNS client will keep retrying until it times out, often for tens of seconds.

ConnHealth must not cause any synchronous network IO at all in order for it to be safe to use in these code paths. We must also make sure that no code implicitly relies on ConnHealth dialing the remote node.

Related to #53410 and #68419 (comment).

Metadata

Metadata

Assignees

Labels

A-server-networkingPertains to network addressing,routing,initializationC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-kvKV Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions