storage: allow queues to process replicas in parallel, increase splitQueue's concurrency#21562
Conversation
A clock was already accessible through the Store, so it was just distracting by crowding the method signatures. Release note: None
|
To expand on the note about auxiliary reasons for prioritizing this change, this is a WIP of what I picture back-pressuring writes on splits looking like. It relies on the changes to |
|
I did some testing of the impact parallelizing splits will have on workloads that To do this, I first dropped the range_max_bytes=4MB split_concurrency=1range_max_bytes=4MB split_concurrency=4As the test shows, at a concurrency of 1, the split queue was unable to keep up * it might eventually once the range count grew large enough that the load was sufficiently I then dropped the range_max_bytes=2MB split_concurrency=1~30x the range max bytes range_max_bytes=2MB split_concurrency=4~3x the range max bytes I also tried the second config with my WIP backpressure feature on. The graph looked Finally, I dropped the range_max_bytes=1MB split_concurrency=1So bad I stopped testing right away. range_max_bytes=1MB split_concurrency=4 with and without backpressureWhile the graphs looked about the same for concurrency=4 with and without Finally, I ran an experiment with a much larger amount of split concurrency. Even range_max_bytes=1MB split_concurrency=16 with backpressure |
|
Nice writeup!
Dynamically increasing split concurrency sounds a bit much. Splits are quick with these small ranges because the stats computation is so fast. With larger ranges increased concurrency might be a problem. Or maybe not. Rather than dynamically adjusting the concurrency, I think a cluster setting plus the back pressure mechanism would be sufficient in the near term. |
Agreed. When would a dynamic adjustment be better than just a higher constant concurrency? Review status: 0 of 12 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. pkg/storage/queue.go, line 288 at r2 (raw file):
The semaphore pattern we use elsewhere (e.g. Stopper.RunLimitedAsyncTask) has the channel initially empty (and then we "acquire" by sending to the channel and "release" by receiving) pkg/storage/queue.go, line 803 at r2 (raw file):
This smells bad to me. Why do we need to shut down everything instead of just locking Comments from Reviewable |
Up until this point, `baseQueue` has always made the assumption that only one replica will be processed at any given time. This has been fine until now, but there are valid cases where we may want to allow a bounded amount of concurrency within the queues. An example of this is the `splitQueue`, where we may want to allow a few replicas to be processed in parallel. This change adds a `maxConcurrency` option to `queueConfig`. `baseQueue` then changes to work off a process semaphore instead of the previous `processMu`. The handling of `processMu` was pretty awkward before, so this change also allows us to clean up some code, remove race conditions that allowed for poorly specified behavior, and address a TODO. One of the auxiliary reasons for prioritizing this change is that until now, queues have always removed their handle to replicas completely once they began processing. This meant that it was impossible to determine which replicas were processing at any given time. It was also impossible to register a callback for the completion of a processing attempt. Both of these were things I wanted to do in relation to back-pressuring writes on splits (cockroachdb#21357). Release note: None
Splits should be relatively isolated, other than requiring expensive RocksDB scans over part of the splitting range to recompute stats. Because of this, we allow a limited number of splits to be processed at once. This prevents slow splits from stalling all other splits from ever running and also improves the splitQueue's ability to keep up with heavy load. Release note: None
`baseQueue.DrainQueue` no longer races with `baseQueue.processLoop`, so we can address an existing TODO and rely on `DrainQueue` to fully process all enqueued replicas before returning. This allows us to simplify a few tests. For each of these tests, I verified that they were still flaky without the `SucceedsSoon` if `DrainQueue` did not call `lockProcessing` and then verified that the flake went away when the `lockProcessing` call was added back in. I'm sure there are other tests that could be improved now, but nothing stuck out to me when I looked. Release note: None
7d4c027 to
bbf95aa
Compare
|
TFTR!
That's a good point, we might as well just increase the max concurrency to start with. Based on the experiments I ran above and on Peter's point about larger ranges being more resource intensive, I think a concurrency of 4 for the Review status: 0 of 12 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending. pkg/storage/queue.go, line 288 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/queue.go, line 803 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Good point, we only needed to acquire from the semaphore once, since we were only running one task at a time. Done. Comments from Reviewable |
|
Reviewed 6 of 9 files at r1, 2 of 3 files at r2, 1 of 1 files at r4, 3 of 3 files at r5. Comments from Reviewable |
|
nit: this would have benefited from a release note mentioning the performance improvement. |
The change in cockroachdb#21562 made it possible for queues to process replicas in parallel. In doing so, it changed how `baseQueue` handled the queuing of replicas that were already processing when `MaybeAdd` was called. This was not an issue before because only a single replica could ever be processed at a time, so it was not an issue to add the replica that was currently processing back into the queue. That said, the previous approach could result in replicas being queued with the wrong priority, which could result in queuing inefficiencies. With the changes made in cockroachdb#21562, `baseQueue` now needs to be careful that the same replica doesn't get processed twice at the same time. This is accomplished by using a `requeue` flag on the `replicaItem`. If the flag is set when a replica finishes processing, we requeue it. This ensures that we never skip a necessary processing attempt, and is safe because all queue processing is idempotent. This change makes sure that we call `ShouldQueue` again before placing replicas with the `requeue` flag back in the queue. This isn't strictly required for correctness (again, because processing is idempotent), but it ensure that we don't add replicas back into the queue needlessly. It also ensures that we place the replica back in the queue using the priority provided by `ShouldQueue` *after* the earlier processing attempt finishes, instead of the priority it had *before*. This fixes a subtle issue that we've actually always had. I suspect that this would have helped keep the queue size down in some of the testing from cockroachdb#21562. It would have also helped us in keeping the max range size down because we would have better optimized the ranges that we prioritized when the queue began to fill. Release note (performance improvement): Multiple ranges can now split at the same time, improving our ability to handle hotspot workloads.






Up until this point,
baseQueuehas always made the assumption that only onereplica will be processed at any given time. This has been fine until now, but
there are valid cases where we may want to allow a bounded amount of concurrency
within the queues. An example of this is the
splitQueue, where we may want toallow a few replicas to be processed in parallel.
This change adds a
maxConcurrencyoption toqueueConfig.baseQueuethenchanges to work off a process semaphore instead of the previous
processMu. Thehandling of
processMuwas pretty awkward before, so this change also allows usto clean up some code, remove race conditions that allowed for poorly specified
behavior, and address a TODO.
One of the auxiliary reasons for prioritizing this change is that until now,
queues have always removed their handle to replicas completely once they began
processing. This meant that it was impossible to determine which replicas were
processing at any given time. It was also impossible to register a callback for
the completion of a processing attempt. Both of these were things I wanted to do
in relation to back-pressuring writes on splits (#21357).
Splits should be relatively isolated, other than requiring expensive RocksDB
scans over part of the splitting range to recompute stats. Because of this, a
later commit allows a limited number of splits to be processed at once. This
prevents slowsplits from stalling all other splits from ever running and also
improves the splitQueue's ability to keep up with heavy load.
Release note: None