-
Notifications
You must be signed in to change notification settings - Fork 4.1k
distsql: improve unhealthy node detection in SQL instances-based planning #120774
Description
Currently, there are two main differences between gossip-based (used by single-tenant deployments) and instances-based planning (used by UA and serverless deployments as well as locality-filtered jobs in single-tenant deployments):
- one is about not scheduling DistSQL work on nodes that are being drained. This is tracked sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down #100578.
- another is about unhealthy-node detection.
This issue focuses on the latter. In particular, in gossip-based planning for each SQL instance we maintain the current node status (i.e. "OK" or "unhealthy"). The gateway SQL instance is always considered healthy, but all other instances are checked explicitly on each DistSQL planning attempt via nodedialer.Dialer.ConnHealthTryDial (to see whether there is a healthy grpc connection) and via NodeLiveness.GetNodeVitality (to see whether the node is "available").
(Checking the connection health also includes looking at gossiped draining info, so #100578 could be thought of subset of this issue).
Both of these checks are currently missing in instances-based planning. In other words, we always assume that all instances present in the cache on top of sql_instances table are reachable and available. If that happens to be incorrect, then we will either retry the query as local (on the main query path) or return an error. The duration of the period when this can happen is determined by how long dead instances remain in sql_instances (TODO: how long is this?), how long it takes for instance cache to be updated (TODO: 45s?), and how long network updates (like updating firewall rules) take.
Jira issue: CRDB-36868
Epic CRDB-39091