Skip to content

sql: distsql planning should avoid sending flows to SQL servers in the process of starting up, and in the process of shutting down #100578

@knz

Description

@knz

New description: we would like a new field in the sql instances table that indicates node readiness.

This would not be set to "ready" before server startup has completed; and it would also be set to a different value during graceful shutdown.

Distsql planning should avoid nodes based on that field.


Original title: jobs,upgrades: deadlock during migrations when initializing multiple servers for 1 tenant

Describe the problem

While investigating #100436 I discovered the following deadlock:

  1. create a new tenant record; no server running yet. At this point the tenant keyspace is empty and it has not run its migrations yet
  2. start two or more servers for that tenant simultaneously (this is important)
  3. observe: often, the tenant servers deadlock.

The underlying problem is the following:

  • when the tenant servers begin their startup, they immediately populate a SQL instance row in the liveness table. However at that time they are not yet ready to process SQL queries because their migrations have not run yet
  • then, one of the two will start running the upgrade migrations.
  • one of these migrations will need a distsql distributed query. That, in turn, will fail because the other server(s) are not yet ready to accept distsql execution:
I230404 10:23:32.796367 15064 upgrade/upgrademanager/manager.go:235 ⋮ [T2,n2] 534  the last permanent upgrade (v1000022.2-94) does not appear to have completed; attempting to run all upgrades
I230404 10:23:32.796659 15064 upgrade/upgrademanager/manager.go:278 ⋮ [T2,n2] 535  running permanent upgrade for version 0.0-2
I230404 10:23:32.824530 15776 1@circuitbreaker/circuitbreaker.go:322 ⋮ [T2,n2] 545  circuitbreaker: ‹rpc 127.0.0.1:64403 [n1]› tripped: unable to look up descriptor for n1: non existent SQL instance
I230404 10:23:32.824557 15776 1@circuitbreaker/circuitbreaker.go:447 ⋮ [T2,n2] 546  circuitbreaker: ‹rpc 127.0.0.1:64403 [n1]› event: ‹BreakerTripped›
W230404 10:23:32.824573 15776 sql/colflow/colrpc/outbox.go:189 ⋮ [T2,n2,intExec=‹select-job›,f‹0f758d2f›,distsql.stmt=‹WITH latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _), latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WH
ERE (info_key = '_') AND (job_id = $1) ORDER BY written DESC LIMIT _) SELECT status, payload.value AS payload, progress.value AS progress, claim_session_id, COALESCE(last_run, created), COALESCE(num_runs, _) FROM system.jobs AS j INNER JOIN latestpayload AS payload ON j.id =›,distsql.gateway=‹2›,distsql.appname=‹$ internal-select-job›,distsq
l.txn=‹b8a5af83-f1f1-4364-9074-5e4bc41d66f5›,streamID=‹9›] 547  Outbox Dial connection error, distributed query will fail: unable to look up descriptor for n1: non existent SQL instance

Expected behavior

DistSQL should recognize when other SQL instances are not yet ready to process queries and not attempt to use them.

Alternatively we could disable DistSQL entirely during upgrade migrations; but that seems undesirable because certain migrations rewrite tables and that would benefit from the extra concurrency.

Jira issue: CRDB-26509

Epic CRDB-39091

Metadata

Metadata

Assignees

Labels

A-cluster-upgradesA-jobsC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-db-serverbranch-masterFailures and bugs on the master branch.branch-release-23.1Used to mark GA and release blockers, technical advisories, and bugs for 23.1

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions