-
Notifications
You must be signed in to change notification settings - Fork 4.1k
sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down #100578
Description
New description: we would like a new field in the sql instances table that indicates node readiness.
This would not be set to "ready" before server startup has completed; and it would also be set to a different value during graceful shutdown.
Distsql planning should avoid nodes based on that field.
Original title: jobs,upgrades: deadlock during migrations when initializing multiple servers for 1 tenant
Describe the problem
While investigating #100436 I discovered the following deadlock:
- create a new tenant record; no server running yet. At this point the tenant keyspace is empty and it has not run its migrations yet
- start two or more servers for that tenant simultaneously (this is important)
- observe: often, the tenant servers deadlock.
The underlying problem is the following:
- when the tenant servers begin their startup, they immediately populate a SQL instance row in the liveness table. However at that time they are not yet ready to process SQL queries because their migrations have not run yet
- then, one of the two will start running the upgrade migrations.
- one of these migrations will need a distsql distributed query. That, in turn, will fail because the other server(s) are not yet ready to accept distsql execution:
I230404 10:23:32.796367 15064 upgrade/upgrademanager/manager.go:235 ⋮ [T2,n2] 534 the last permanent upgrade (v1000022.2-94) does not appear to have completed; attempting to run all upgrades
I230404 10:23:32.796659 15064 upgrade/upgrademanager/manager.go:278 ⋮ [T2,n2] 535 running permanent upgrade for version 0.0-2
I230404 10:23:32.824530 15776 1@circuitbreaker/circuitbreaker.go:322 ⋮ [T2,n2] 545 circuitbreaker: ‹rpc 127.0.0.1:64403 [n1]› tripped: unable to look up descriptor for n1: non existent SQL instance
I230404 10:23:32.824557 15776 1@circuitbreaker/circuitbreaker.go:447 ⋮ [T2,n2] 546 circuitbreaker: ‹rpc 127.0.0.1:64403 [n1]› event: ‹BreakerTripped›
W230404 10:23:32.824573 15776 sql/colflow/colrpc/outbox.go:189 ⋮ [T2,n2,intExec=‹select-job›,f‹0f758d2f›,distsql.stmt=‹WITH latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _), latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WH
ERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _) SELECT status, payload.value AS payload, progress.value AS progress, claim_session_id, COALESCE(last_run, created), COALESCE(num_runs, _) FROM system.jobs AS j INNER JOIN latestpayload AS payload ON j.id =›,distsql.gateway=‹2›,distsql.appname=‹$ internal-select-job›,distsq
l.txn=‹b8a5af83-f1f1-4364-9074-5e4bc41d66f5›,streamID=‹9›] 547 Outbox Dial connection error, distributed query will fail: unable to look up descriptor for n1: non existent SQL instance
Expected behavior
DistSQL should recognize when other SQL instances are not yet ready to process queries and not attempt to use them.
Alternatively we could disable DistSQL entirely during upgrade migrations; but that seems undesirable because certain migrations rewrite tables and that would benefit from the extra concurrency.
Jira issue: CRDB-26509
Epic CRDB-39091