-
Notifications
You must be signed in to change notification settings - Fork 555
db: automatically tune compaction concurrency based on available CPU/disk headroom and read-amp #1329
Description
This issue was originally about the need to reduce tuning knobs in Pebble that govern performance more generally. Over time, some of these have been adjusted, or have had adjustments considered. The one major one that remains is MaxConcurrentCompactions which gets set to min(NumCPU(), 3) which is too few compactions at once than can be handled by many beefy nodes with fast NVMe drives. Pebble, along with Cockroach's admission control, should be able to schedule additional compactions in the presence of greater CPU and disk IO headroom to allow for greater compaction concurrency, without necessitating manual operator intervention to unlock additional performance.
There's a similar case to be made about increasing compaction concurrency even in the presence of heavy foreground write traffic, as it reduces the work necessary to incorporate future foreground write traffic into a well-formed LSM. Even in this case, adaptive increases in compaction concurrency will yield better foreground write performance in the longer run and can be considered even though it might seem instantaneously unintuitive to take away more disk / CPU bandwidth for a background operation.
Original text of the issue follows below the horizontal line.
Pebble has various tuning knobs that affect its CPU and disk IOPS/bandwidth utilization. Examples include L0CompactionConcurrency, CompactionDebtConcurrency, MaxConcurrentCompactions, DeleteRangeFlushDelay, MinDeletionRate.
Some of these, like MinDeletionRate and DeleteRangeFlushDelay, affect how fast we reclaim disk space, so are possibly not very interesting in most circumstances -- it is relatively easier to properly provision disk byte capacity.
The compaction ones affect how many concurrent compactions we can run, and being able to run more concurrent compactions would allow Pebble to handle a higher write throughput without having a misshapen LSM. CockroachDB accepts the default value for these knobs except for MaxConcurrentCompactions, which is set to min(numCPU, 3). It is likely that these knob settings are sub-optimal in most settings: (a) the limit of 3 is likely to be too low on large machines, (b) even 3 compactions could be too many on a large machine if there is a spike in read-only user-facing traffic that wants to consume all cpu, or if disk IOPs/bandwidth are saturated due to user-facing traffic.
We have discussed ways to speed compactions in the past by parallelizing them into multiple sub-compactions that partition the key span of the compaction. More recently, there are ideas to parallelize the compression/decompression of ssblocks within a compaction.
These compaction-related ideas have the following need in common: detect the current resource utilization (or overload of a resource), and use that to adjust the amount of concurrent background work, so as to not affect foreground traffic. It is possible that such adjustments would need to be fine-grained and slow down an already started compaction.
Detecting resource utilization or overload for CPU is relatively easier -- the CockroachDB context already has hooks that look at goroutine scheduler state. Disk IOPs/bandwidth is harder: log commit latency is one signal, but it may not always be a good indicator, so we would need to run some overload experiments to understand this better.
Jira issue: PEBBLE-119