sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down

New description: we would like a new field in the sql instances table that indicates node readiness.

This would not be set to "ready" before server startup has completed; and it would also be set to a different value during graceful shutdown.

Distsql planning should avoid nodes based on that field.


----

Original title: **jobs,upgrades: deadlock during migrations when initializing multiple servers for 1 tenant**

**Describe the problem**

While investigating #100436 I discovered the following deadlock:

1. create a new tenant record; no server running yet. At this point the tenant keyspace is empty *and it has not run its migrations yet*
2. start **two or more servers for that tenant simultaneously**  (this is important)
3. observe: often, the tenant servers deadlock.

The underlying problem is the following:

- when the tenant servers **begin** their startup, they immediately populate a SQL instance row in the liveness table. **However at that time they are not yet ready to process SQL queries because their migrations have not run yet**
- then, one of the two will start running the upgrade migrations.
- one of these migrations will need a distsql distributed query. That, in turn, will fail because the other server(s) are not yet ready to accept distsql execution:

```
I230404 10:23:32.796367 15064 upgrade/upgrademanager/manager.go:235 ⋮ [T2,n2] 534  the last permanent upgrade (v1000022.2-94) does not appear to have completed; attempting to run all upgrades
I230404 10:23:32.796659 15064 upgrade/upgrademanager/manager.go:278 ⋮ [T2,n2] 535  running permanent upgrade for version 0.0-2
I230404 10:23:32.824530 15776 1@circuitbreaker/circuitbreaker.go:322 ⋮ [T2,n2] 545  circuitbreaker: ‹rpc 127.0.0.1:64403 [n1]› tripped: unable to look up descriptor for n1: non existent SQL instance
I230404 10:23:32.824557 15776 1@circuitbreaker/circuitbreaker.go:447 ⋮ [T2,n2] 546  circuitbreaker: ‹rpc 127.0.0.1:64403 [n1]› event: ‹BreakerTripped›
W230404 10:23:32.824573 15776 sql/colflow/colrpc/outbox.go:189 ⋮ [T2,n2,intExec=‹select-job›,f‹0f758d2f›,distsql.stmt=‹WITH latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _), latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WH
ERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _) SELECT status, payload.value AS payload, progress.value AS progress, claim_session_id, COALESCE(last_run, created), COALESCE(num_runs, _) FROM system.jobs AS j INNER JOIN latestpayload AS payload ON j.id =›,distsql.gateway=‹2›,distsql.appname=‹$ internal-select-job›,distsq
l.txn=‹b8a5af83-f1f1-4364-9074-5e4bc41d66f5›,streamID=‹9›] 547  Outbox Dial connection error, distributed query will fail: unable to look up descriptor for n1: non existent SQL instance
```

**Expected behavior**

DistSQL should recognize when other SQL instances are not yet ready to process queries and not attempt to use them.

Alternatively we could disable DistSQL entirely during upgrade migrations; but that seems undesirable because certain migrations rewrite tables and that would benefit from the extra concurrency.

Jira issue: CRDB-26509


Epic CRDB-39091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down #100578

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down #100578

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions