-
Notifications
You must be signed in to change notification settings - Fork 555
perf: investigate N-level compactions (N > 2) #136
Description
Compactions are currently limited to inputs from 2 levels, but that is a small limitation in the code and the compaction picking heuristics, not a fundamental limitation of compactions. The problem with limiting compactions to 2 levels is that doing so increases write-amplification and complicates policies for picking which compaction to perform. Consider a write only workload with default options and all new data (so that compactions are simply do not result in any space savings, which is a reasonably assumption for any non-bottom-level compaction). Below I'm showing only non-empty levels:
L0 64 MB
L5 64 MB
L6 64 MB
The compaction picker will see this state and perform an L0->L5 compaction resulting in:
L0 0 MB
L5 128 MB
L6 64 MB
If the write workload is turned off, there will be a subsequent compaction at this point from L5->L6:
L0 0 MB
L5 0 MB
L6 192 MB
By doing 2 compactions we've increased write-amplification. Even worse, if the write workload wasn't turned off, we could "starve" the L5->L6 compaction. Let's backup a step to the original L0->L5 compaction and imagine that while it is taking place, another 64 MB L0 table is being flushed. The result is:
L0 64 MB
L5 128 MB
L6 64 MB
The compaction picking heuristics might pick the L5->L6 compaction at this point, but even if they do, pretty soon we'd get into a situation where L0 increases in size faster than L5->L6 compactions can clean it out. Back-pressure is definitely needed at some point (see #7), but I suspect that allowing multi-level compactions will also help.
The heuristic that I propose to explore is to expand a compaction to contain the next level if the current output level would become over-full due to the compaction. We estimate the number of new bytes to the output level as the number of incoming bytes from other levels (this doesn't take into account updated keys, butI believe that is ok). So back to our original state:
L0 64 MB
L5 64 MB
L6 64 MB
We'd decide that an L0->L5 compaction needs to take place, and estimate the new size of L5 as 128 MB. This would exceed the L5 max-size threshold (64 MB since it is the base level) which would cause us to include L6 in the compaction.
The end result should be a decrease in starvation of Lbase->Lbase+1 compactions because we're automatically including them in the L0->Lbase compaction.
There is a major caveat to implementing this with the current Pebble compaction setup and mechanisms: L0 might span the entire key space, so an L0->Lbase compaction might involve compacting all of Lbase which would in turn compact all of Lbase+1. There are a variety of ways in which this could be addressed. For example, L0 tables should be split at boundaries that line up with the sstables in Lbase.
Jira issue: PEBBLE-182