-
Notifications
You must be signed in to change notification settings - Fork 556
perf: generalized "move" compactions #518
Description
See #517 for background on sstable boundary signals. Consider the same scenario mentioned in that issue:
L0 a------------------------z
L1 ab de gh jk mn pq st vw yz
In this scenario, L0 only contains two keys: a and z. Currently, a compaction from L0 to L1 will end up rewriting all of the sstables in L1. This is quite wasteful as we really only need to rewrite 2 sstables. The rest could be reused. Pebble already supports the notion of a "move" compaction which is performed if a compaction from Ln to Lm involves only a single input table. In that case, the input table is simply moved to the output table level: a pure metadata operation that is very fast. The generalization of "move" compactions is to recognize when the output records for a compaction are identical to the records in a single input table.
I have a vague sense of how this could be done. We could create separate levelIters on the input tables and seek on them to see where there are gaps in the key space of the inputs. In the scenario here, we'd create an inputIter[0] and inputIter[1]. When the compactionIter steps into one of the L1 input tables, we'd check the inputIter[0] to see if there are any keys which overlap in L0. If there are, we proceed with compaction until the next L1 input table. If there is no overlap with keys in L0, we "move" the L1 input table directly to output. Reminder that we need to check for both point and range tombstone overlap.
One downside to moving sstables like this, is that we lose the snapshot and tombstone processing that takes place in compactionIter. This is only a problem if we're actually moving an sstable from L0 to L1.
I don't see any fundamental obstacles to this mechanism, though the compaction logic is already dauntingly complex. As a prerequisite, we need to figure out better abstractions, and to break up the already too complex DB.runCompaction method.
Jira issue: PEBBLE-177