compaction: propose multi level compaction heursitics#1746
compaction: propose multi level compaction heursitics#1746msbutler wants to merge 2 commits intocockroachdb:masterfrom
Conversation
Previously, levelled compaction only compacted SSTs from at most 2 levels. This change introduces the mechanics to conduct a 3 level compaction as an experimental feature. No multi level compactions will actually occur, as the new `initMultiLevelCompaction` function always returns false. Future work includes: - consider hueristics to init multi Level compactions - add mechanics to init multi level compactions with more than 3 levels Informs cockroachdb#136
|
Before posting some updated benchmarking results with additional low level metrics (hopefully next week), I wanted to respond to some questions raised about the theoretical benefits of multi-level compaction and describe two more heuristics I'll experiment with. Modeling WriteAmp with a MultiLevel Compaction PolicyIn the first two paragraphs in @sumeerbhola's comment here, , he describes a toy model used to think about write amp when multiLevel compaction is possible. What if we modelled per level write-amp as an expected value , allowing us to cleanly model it as a function of multiLevel compaction probability. I.e.:
where I think that first term can be set to 0, implying that If the formula above is valid, I think i we can model the probability of a multi level compaction as a function of per level file size, per level fan out / write amp. On the other hand, maybe my logic is totally off :D. @petermattis also pointed me to his toy model where he works out write amp in a multiLevel compaction toy model. Like him, I'm quite puzzled by the math. Specifically, I don't understand where the per level write amp definition comes from (below). From the bit of reading I've done, it seems that per level fanout ( Adjusting the Multi-Level Compaction HueristicCurrently, the compaction picker only triggers a multi level compaction iff the original output level is about to fill up or if over 50% of it is involved in the compaction. I will consider adding the following heuristics: 1) Triggering a multilevel compaction only when there's at least some overlap in the multi level output level (a la
|
sumeerbhola
left a comment
There was a problem hiding this comment.
I'm having a hard time understanding anything in https://github.com/petermattis/compacttoy/blob/master/multilevel.go -- I'm probably thinking about this with a different mental model. We should ask Peter what he does recall -- maybe he is puzzled by only some later bits of the math so can help get us to at least the same level of understanding.
I was roughly thinking of the following -- want to understand how close this is to what you have in mind.
Do a multi-level compaction when all the following conditions are satisfied
- The output level will fill up due to this compaction (I am not convinced about a disjunction with 50% involved in the compaction, since that may result in excessive compactions when the output level is close to empty, when a level first starts getting populated, that can happen when the LSM is new-ish.)
- Calculate
w-amp-2-level = compaction-output-level-bytes/compaction-input-level-bytesandw-amp-3-level = compaction-output-level-plus-1-bytes/ (compaction-output-level-bytes+compaction-input-level-bytes). Ifw-amp-3-level + delta < w-amp-2-level(I don't know what delta should be -- perhaps 0), then do a 3 level compaction. The idea here is that even though we are not choosing the best keyspan from the perspective of thekMinOverlappingRatio, we are choosing something where the write-amp for this 3 level compaction is not worse than the 2 level compaction.
I have of course no idea whether the above will work well -- I am making things up. But the rough principle I'd like to follow is:
- Use conservative heuristics that we are somewhat convinced will do no damage.
- Add metrics (I mentioned some in my comments on the previous PR) that help us deeply understand how these heuristics are behaving.
- Run experiments on two kinds of LSMs, and say add 50GB (a) starting from empty, (b) starting with 300GB. For (b) we can copy the LSM state after it reaches 300GB and use that as the starting state for each subsequent experiment. That will help us understand the behavior of heuristics both on a "young" LSM and an "old" LSM.
- Iterate over the previous steps.
None of this requires production quality code, so we can optimize on experimentation speed. I'm happy to promptly look at new heuristics and the metrics that help us understand the experiment results (I was ooo for a couple of weeks), in case you want an extra pair of eyes through the process (which will also help in the eventual code review of the heuristics we settle on, since the why-this-heuristic question will already be answered).
btw, in your earlier experiments, how were you ensuring that compactions were not falling behind? Was the compaction concurrency set to very high, or was the write concurrency low? We want to run this with high amount of provisioned bandwidth and high enough compaction concurrency such that none of the experiments have a compaction backlog when the writes end, so that we are getting write-amp numbers that are actually comparable.
Reviewable status: 0 of 15 files reviewed, all discussions resolved (waiting on @msbutler)
|
I spend some time working out some examples on paper with the toy model from #1603 (comment), with 3 levels that are initially populated with (0, 0, 10) bytes respectively. The end state is to get to (0, 0, x) where x is close to 100. With 2 level compactions we go through intermediate states like (1, 0, 10)=> (2, 0, 10) => ... (10, 0, 10) => (0, 10, 10) => (0, 0, 20) ... => (0, 0, 100) which will eventually cause a rewrite cost of 45*9 for the second level and 45*10 for the third level. Let us summarize this rewrite cost as a tuple (405, 450) in this case for a total of 855. So far there is nothing new here -- we are following the same line of thinking. But the reason for this exercise is that I am bothered by the notion that we would do 3 level compactions only when the middle level is full. In practice we have multiple files per level, and I think we should be deciding on a 3 level compaction based on a local property of the part of the key space this compaction is touching, and not the whole level. Also, in steady state each level is always close to full with our current compaction setup -- this is again because we have multiple files in each level and we move a small fraction of a level down to the next level in a compaction. So the fact that the level L is now "exactly full" is a good signal for compaction scoring, for deciding when to initiate a compaction with level L as the top level, but doesn't seem a good signal for when a compaction starting at L-1 that will write to L, should also involve L+1 and write to L+1 instead. This latter decision should be based on the relative costs of compaction(L-1,L) and compaction(L-1,L,L+1) and not the fullness of L. Based on this, I tried to work out examples with the same toy model using the following (from the previous comment) as an independent criteria for deciding on a 3 level compaction, regardless of the fullness of the middle level.
What this does is that the rewrite cost tuple (which was (324,450) when doing 3-level compactions only when the middle level is full), starts changing such that the first element of the tuple reduces and the second element increases, because we can start doing more 3-level compactions. I can't construct a closed form solution (and anyway this a toy model), but with delta=0 we get to (0, 0, 97) with a rewrite cost tuple of (220,518). Compared with (324,450), note the lower first number and the higher second number. The total is 738 vs 774, though some of this improvement is because we got to 97 and not to 100. With delta=-1, we get a rewrite cost tuple of (239,473) (note that the first element has increased compared to delta=0, since we are running fewer 3-level compactions and the second element has decreased), again to get to (0,0,97). The total is 712, which is better than delta=0. Note that delta values don't need to be integers, but since I was working this out by hand I only tried those. delta=-2 was not necessarily better. The exact numbers in this toy model don't matter, but what I am trying to get at is the intuition that by fiddling with this delta we make a tradeoff between very aggressive 3-level compactions (that may increase overall write amp), or reasonable 3-level compactions (that can decrease write amp), without involving the level fullness as a criteria. Experiments with real data would inform what is a good delta value in practice. |
I don't understand the math I wrote in that code either. I think better to start with the comment:
I believe the complexity in the math is determining when is a level "too large". Rather than use the level sizing heuristics, the multi-level compaction strategy aims to minimize total write-amp which occurs when all levels experience the same write-amp. I very well might have butchered something in the math there. Could be useful to add some debugging statements to that simulation to see how it behaves as data is written. |
|
closing in favor of #2787 |
This PR proposes heuristic to start a multilevel compaction, leveraging the multilevel compaction mechanics introduced in #1603
WIP: still experimenting/experimenting with compaction mechanics.