Skip to content

distsql: improve unhealthy node detection in SQL instances-based planning #120774

@yuzefovich

Description

@yuzefovich

Currently, there are two main differences between gossip-based (used by single-tenant deployments) and instances-based planning (used by UA and serverless deployments as well as locality-filtered jobs in single-tenant deployments):

This issue focuses on the latter. In particular, in gossip-based planning for each SQL instance we maintain the current node status (i.e. "OK" or "unhealthy"). The gateway SQL instance is always considered healthy, but all other instances are checked explicitly on each DistSQL planning attempt via nodedialer.Dialer.ConnHealthTryDial (to see whether there is a healthy grpc connection) and via NodeLiveness.GetNodeVitality (to see whether the node is "available").

(Checking the connection health also includes looking at gossiped draining info, so #100578 could be thought of subset of this issue).

Both of these checks are currently missing in instances-based planning. In other words, we always assume that all instances present in the cache on top of sql_instances table are reachable and available. If that happens to be incorrect, then we will either retry the query as local (on the main query path) or return an error. The duration of the period when this can happen is determined by how long dead instances remain in sql_instances (TODO: how long is this?), how long it takes for instance cache to be updated (TODO: 45s?), and how long network updates (like updating firewall rules) take.

Jira issue: CRDB-36868

Epic CRDB-39091

Metadata

Metadata

Assignees

Labels

A-multitenancyRelated to multi-tenancyA-sql-executionRelating to SQL execution.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-db-serverT-server-and-securityDB Server & Security

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions