distsql: improve unhealthy node detection in SQL instances-based planning

Currently, there are two main differences between gossip-based (used by single-tenant deployments) and instances-based planning (used by UA and serverless deployments as well as locality-filtered jobs in single-tenant deployments):
- one is about not scheduling DistSQL work on nodes that are being drained. This is tracked #100578.
- another is about unhealthy-node detection.

This issue focuses on the latter. In particular, in gossip-based planning for each SQL instance we maintain the current node status (i.e. "OK" or "unhealthy"). The gateway SQL instance is always considered healthy, but all other instances are checked explicitly on each DistSQL planning attempt via `nodedialer.Dialer.ConnHealthTryDial` (to see whether there is a healthy grpc connection) and via `NodeLiveness.GetNodeVitality` (to see whether the node is "available").

(Checking the connection health also includes looking at gossiped draining info, so #100578 could be thought of subset of this issue).

Both of these checks are currently missing in instances-based planning. In other words, we always assume that all instances present in the cache on top of `sql_instances` table are reachable and available. If that happens to be incorrect, then we will either retry the query as local (on the main query path) or return an error. The duration of the period when this can happen is determined by how long dead instances remain in `sql_instances` (TODO: how long is this?), how long it takes for instance cache to be updated (TODO: 45s?), and how long network updates (like updating firewall rules) take.

Jira issue: CRDB-36868

Epic CRDB-39091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: improve unhealthy node detection in SQL instances-based planning #120774

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

distsql: improve unhealthy node detection in SQL instances-based planning #120774

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions