-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rpc: Dialer.ConnHealth must not cause network IO #69888
Description
Dialer.ConnHealth is used to check whether a healthy RPC connection exists to a given node. This is often done to avoid interacting with unhealthy nodes. However, internally this method actually dials the remote node if no connection already exists:
cockroach/pkg/rpc/nodedialer/nodedialer.go
Lines 223 to 224 in 793b4c8
| conn := n.rpcContext.GRPCDialNode(addr.String(), nodeID, class) | |
| return conn.Health() |
The method is called synchronously in several performance-critical code paths, including the main Replica.handleRaftReady Raft processing path (inside Replica.updateProposalQuotaRaftMuLocked) and the Store.raftTickLoop Raft tick path (inside Store.updateLivenessMap). This is problematic because the node dial can hang for a long time — in particular, if the remote IP address or DNS server does not respond at all (which can happen e.g. with power loss, VM shutdown, network connectivity problems, etc) the TCP/IP stack or DNS client will keep retrying until it times out, often for tens of seconds.
ConnHealth must not cause any synchronous network IO at all in order for it to be safe to use in these code paths. We must also make sure that no code implicitly relies on ConnHealth dialing the remote node.
Related to #53410 and #68419 (comment).