Skip to content

compaction: propose multi level compaction heursitics#1746

Closed
msbutler wants to merge 2 commits intocockroachdb:masterfrom
msbutler:butler-multi-level-heuristics
Closed

compaction: propose multi level compaction heursitics#1746
msbutler wants to merge 2 commits intocockroachdb:masterfrom
msbutler:butler-multi-level-heuristics

Conversation

@msbutler
Copy link
Copy Markdown
Contributor

@msbutler msbutler commented Jun 8, 2022

This PR proposes heuristic to start a multilevel compaction, leveraging the multilevel compaction mechanics introduced in #1603

WIP: still experimenting/experimenting with compaction mechanics.

msbutler added 2 commits June 8, 2022 17:06
Previously, levelled compaction only compacted SSTs from at most 2 levels. This
change introduces the mechanics to conduct a 3 level compaction as an
experimental feature. No multi level compactions will actually occur, as the
new `initMultiLevelCompaction` function always returns false. Future work
includes:
  - consider hueristics to init multi Level compactions
  - add mechanics to init multi level compactions with more than 3 levels

Informs cockroachdb#136
@msbutler msbutler self-assigned this Jun 8, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Jun 24, 2022

Before posting some updated benchmarking results with additional low level metrics (hopefully next week), I wanted to respond to some questions raised about the theoretical benefits of multi-level compaction and describe two more heuristics I'll experiment with.

Modeling WriteAmp with a MultiLevel Compaction Policy

In the first two paragraphs in @sumeerbhola's comment here, , he describes a toy model used to think about write amp when multiLevel compaction is possible. What if we modelled per level write-amp as an expected value , allowing us to cleanly model it as a function of multiLevel compaction probability. I.e.:

PerLevelWriteAmp = Probability_Multilevel*Wamp_MultiLevel+(1-Probability_Multilevel)*(Wamp_SingleLevel)

where Wamp_SingleLevel = (bytesFlushed+ BytesCompacted)/(bytesWritten), and Probability_MultiLevel is the probability that the given level is an intermediate level in a multiLevel compaction-- data will be read from the level, but no data will be written to it.

I think that first term can be set to 0, implying that PerLevelWriteAmp = (1-Probability_Multilevel)*(Wamp_SingleLevel). Why? When a level is an intermediate level, the compaction picker is avoiding some contribution to writeAmp to that level. After all, if multiLevel compaction didn't exist, that compaction event would have contributed some write amp at the given level.

If the formula above is valid, I think i we can model the probability of a multi level compaction as a function of per level file size, per level fan out / write amp. On the other hand, maybe my logic is totally off :D.

@petermattis also pointed me to his toy model where he works out write amp in a multiLevel compaction toy model. Like him, I'm quite puzzled by the math. Specifically, I don't understand where the per level write amp definition comes from (below). From the bit of reading I've done, it seems that per level fanout (s_(x+1)/s_x) is a decent heuristic for per level write amp, and but his model is doing something more complicated.

// Write amplification is minimized when it is the same at each
// level. The formulas below define the write amplification at each
// level based on the size of each level.
//
// s(x) = size of level x
// T(y) = sum of level sizes [1,y]
// T(y+1) = T(y) + s(y+1)
//
// w(1) = s(1) / 2               ; write-amp at level 1
// w(n) = T(n) / (T(n-1) + 1)    ; write-amp at level n

Adjusting the Multi-Level Compaction Hueristic

Currently, the compaction picker only triggers a multi level compaction iff the original output level is about to fill up or if over 50% of it is involved in the compaction. I will consider adding the following heuristics:

1) Triggering a multilevel compaction only when there's at least some overlap in the multi level output level (a la kminOverlappingRatio).

As @sumeerbhola points out in the last paragraph in his comment here, we should only trigger a multilevel compaction if output level is sufficiently involved.

Perhaps we should revert the multiLevelCompaction, within compaction_picker.go, if the chosen files in the multi-level output level (L3, in Sumeer's model) do not satisfy some minimum overlap threshold. I think this reversion would prevent a multilevel compaction that would cause a ton of write amp, even if the output level has a ton of data.

2) Triggering a multilevel compaction if very few/zero bytes are involved in the original output level

As @itsbilal points out:

we could schedule multi-level compactions [...] where we're merging into very few files/bytes in the original output level (but possibly still nonzero bytes) compared to the input level. It probably means we can make those output level files the middle level and compact into one level further down.

Given sumeer's concerns, we should probably only do this if we add the first additional heuristic I describe above.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time understanding anything in https://github.com/petermattis/compacttoy/blob/master/multilevel.go -- I'm probably thinking about this with a different mental model. We should ask Peter what he does recall -- maybe he is puzzled by only some later bits of the math so can help get us to at least the same level of understanding.

I was roughly thinking of the following -- want to understand how close this is to what you have in mind.
Do a multi-level compaction when all the following conditions are satisfied

  • The output level will fill up due to this compaction (I am not convinced about a disjunction with 50% involved in the compaction, since that may result in excessive compactions when the output level is close to empty, when a level first starts getting populated, that can happen when the LSM is new-ish.)
  • Calculate w-amp-2-level = compaction-output-level-bytes/compaction-input-level-bytes and w-amp-3-level = compaction-output-level-plus-1-bytes/ (compaction-output-level-bytes+compaction-input-level-bytes). If w-amp-3-level + delta < w-amp-2-level (I don't know what delta should be -- perhaps 0), then do a 3 level compaction. The idea here is that even though we are not choosing the best keyspan from the perspective of the kMinOverlappingRatio, we are choosing something where the write-amp for this 3 level compaction is not worse than the 2 level compaction.

I have of course no idea whether the above will work well -- I am making things up. But the rough principle I'd like to follow is:

  • Use conservative heuristics that we are somewhat convinced will do no damage.
  • Add metrics (I mentioned some in my comments on the previous PR) that help us deeply understand how these heuristics are behaving.
  • Run experiments on two kinds of LSMs, and say add 50GB (a) starting from empty, (b) starting with 300GB. For (b) we can copy the LSM state after it reaches 300GB and use that as the starting state for each subsequent experiment. That will help us understand the behavior of heuristics both on a "young" LSM and an "old" LSM.
  • Iterate over the previous steps.

None of this requires production quality code, so we can optimize on experimentation speed. I'm happy to promptly look at new heuristics and the metrics that help us understand the experiment results (I was ooo for a couple of weeks), in case you want an extra pair of eyes through the process (which will also help in the eventual code review of the heuristics we settle on, since the why-this-heuristic question will already be answered).

btw, in your earlier experiments, how were you ensuring that compactions were not falling behind? Was the compaction concurrency set to very high, or was the write concurrency low? We want to run this with high amount of provisioned bandwidth and high enough compaction concurrency such that none of the experiments have a compaction backlog when the writes end, so that we are getting write-amp numbers that are actually comparable.

Reviewable status: 0 of 15 files reviewed, all discussions resolved (waiting on @msbutler)

@sumeerbhola
Copy link
Copy Markdown
Collaborator

sumeerbhola commented Jul 6, 2022

I spend some time working out some examples on paper with the toy model from #1603 (comment), with 3 levels that are initially populated with (0, 0, 10) bytes respectively. The end state is to get to (0, 0, x) where x is close to 100. With 2 level compactions we go through intermediate states like (1, 0, 10)=> (2, 0, 10) => ... (10, 0, 10) => (0, 10, 10) => (0, 0, 20) ... => (0, 0, 100) which will eventually cause a rewrite cost of 45*9 for the second level and 45*10 for the third level. Let us summarize this rewrite cost as a tuple (405, 450) in this case for a total of 855.
The scheme that does 3 level compactions when the second level is going to become full will incur a rewrite cost of 36*9 for the second level and the same 45*10 for the third level so (324, 450) for a total of 774.

So far there is nothing new here -- we are following the same line of thinking. But the reason for this exercise is that I am bothered by the notion that we would do 3 level compactions only when the middle level is full. In practice we have multiple files per level, and I think we should be deciding on a 3 level compaction based on a local property of the part of the key space this compaction is touching, and not the whole level. Also, in steady state each level is always close to full with our current compaction setup -- this is again because we have multiple files in each level and we move a small fraction of a level down to the next level in a compaction. So the fact that the level L is now "exactly full" is a good signal for compaction scoring, for deciding when to initiate a compaction with level L as the top level, but doesn't seem a good signal for when a compaction starting at L-1 that will write to L, should also involve L+1 and write to L+1 instead. This latter decision should be based on the relative costs of compaction(L-1,L) and compaction(L-1,L,L+1) and not the fullness of L.

Based on this, I tried to work out examples with the same toy model using the following (from the previous comment) as an independent criteria for deciding on a 3 level compaction, regardless of the fullness of the middle level.

Calculate w-amp-2-level = compaction-output-level-bytes/compaction-input-level-bytes and w-amp-3-level = compaction-output-level-plus-1-bytes/ (compaction-output-level-bytes+compaction-input-level-bytes). If w-amp-3-level + delta < w-amp-2-level (I don't know what delta should be -- perhaps 0), then do a 3 level compaction.

What this does is that the rewrite cost tuple (which was (324,450) when doing 3-level compactions only when the middle level is full), starts changing such that the first element of the tuple reduces and the second element increases, because we can start doing more 3-level compactions. I can't construct a closed form solution (and anyway this a toy model), but with delta=0 we get to (0, 0, 97) with a rewrite cost tuple of (220,518). Compared with (324,450), note the lower first number and the higher second number. The total is 738 vs 774, though some of this improvement is because we got to 97 and not to 100. With delta=-1, we get a rewrite cost tuple of (239,473) (note that the first element has increased compared to delta=0, since we are running fewer 3-level compactions and the second element has decreased), again to get to (0,0,97). The total is 712, which is better than delta=0. Note that delta values don't need to be integers, but since I was working this out by hand I only tried those. delta=-2 was not necessarily better. The exact numbers in this toy model don't matter, but what I am trying to get at is the intuition that by fiddling with this delta we make a tradeoff between very aggressive 3-level compactions (that may increase overall write amp), or reasonable 3-level compactions (that can decrease write amp), without involving the level fullness as a criteria. Experiments with real data would inform what is a good delta value in practice.

@petermattis
Copy link
Copy Markdown
Collaborator

I'm having a hard time understanding anything in https://github.com/petermattis/compacttoy/blob/master/multilevel.go -- I'm probably thinking about this with a different mental model. We should ask Peter what he does recall -- maybe he is puzzled by only some later bits of the math so can help get us to at least the same level of understanding.

I don't understand the math I wrote in that code either. I think better to start with the comment:

// The multi-level compaction strategy. A flush attempts to write data into the
// lowest non-empty level. Failing to find such a level, compactions are rooted
// at the highest level and include the subsequent levels such that the
// compaction would not cause the level being compacted into to become "too
// large". The observation behind this strategy is that if a compaction from
// Ln->Ln+1 causes Ln+1 to require a compaction, then we would have been better
// off to perform a compaction from Ln->Ln+2 (i.e. a compaction which
// encompasses Ln+1). A similar observation applies to flushes: if a flush to
// Ln triggers a compaction from Ln->Ln+1, then that compaction should be
// merged with the flush in order to reduce write amplification.

I believe the complexity in the math is determining when is a level "too large". Rather than use the level sizing heuristics, the multi-level compaction strategy aims to minimize total write-amp which occurs when all levels experience the same write-amp. I very well might have butchered something in the math there. Could be useful to add some debugging statements to that simulation to see how it behaves as data is written.

@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Aug 4, 2023

closing in favor of #2787

@msbutler msbutler closed this Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants