-
Notifications
You must be signed in to change notification settings - Fork 4.1k
sql: schema changes can be very slow #47790
Description
Describe the problem
We've had a number of reports of schema changes being slow lately (##38111, #47607, #47512). This is a meta-issue for a problem which is theorized here and to track reproduction and evaluation. It also includes proposed steps.
Theory
We’ve exposed the job adoption cycle to some schema changes even if they’d happen super fast.
- Run just sends on the channel N times for N jobs.
- It might pick up jobs other than the ones we wanted to pick
- Especially bad is we might pick up GC jobs which jobs now create from other jobs.
- Also might pick up queued schema changes which will fail immediately due to unmet dependencies.
Alternative Consideration
We scan the entire jobs table in the job adoption loop, as this table gets big, that's probably slow.
Proposals
-
1. Understand the exact implications with regards to user-visible latency for schema changes
-
We should write some tests which are hella slow and fix that
- N goroutines, each M time creating a table, doing some queued schema changes, drop the table
- Also randomized syntax tests seem slow
-
2. Prioritize the jobs we adopt in the adoption loop
-
We really want to adopt just about anything over top of GC Jobs
-
We really don't want to adopt jobs that have unmet mutation dependencies
-
We could create a simple ranking mechanism where we prioritize
- Type - Do GC jobs last maybe
- Key - choose uniformly randomly over the jobs with the same key
- Value - choose in value order for jobs with the same key
I'm not exactly sold on the type prioritization but at least it's easy.
For the others, we could inject a function per type that controls its ranking.
It doesn't need to be very tightly coupled.
-
3. Start jobs immediately after finishing a job.
-
Right now we wait for the adoption loop before picking up another job. If you have a number of queued mutations, it may take a long time for the subsequent jobs to get picked up.
-
Really easy
-
4. Stop creating jobs that just wait.
-
Right now we create jobs that make no sense as jobs.
-
There is no value if another node adopts the job except for coordinating the waiting.
-
These excess jobs probably exacerbate the above problems.
-
5. Create an index on the status column of jobs.
-
Junk piles up and then the query gets very slow.
-
Adding an index should be an easy migration.
Environment:
- Applies especially to CockroachDB 20.1