compaction: introduce multi level compaction mechanics#1603
compaction: introduce multi level compaction mechanics#1603msbutler merged 1 commit intocockroachdb:masterfrom
Conversation
compaction.go
Outdated
|
|
||
| // If there are no inputs from the output level (eg, a move | ||
| // compaction), add an empty LevelInfo to info.Input. | ||
| // why not check c.kind? |
There was a problem hiding this comment.
small piece of code i didn't quite understand that I modified. Why not check c.kind here?
bananabrick
left a comment
There was a problem hiding this comment.
Just a drive by comment:
I believe that enabling this feature for manual compactions will be tricky. Currently, the key ranges for manual compactions are picked assuming that an Li -> Li+1 compaction will be performed.
In general, manual compactions also have the property that we compact on level at a time.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)
bananabrick
left a comment
There was a problem hiding this comment.
In general, manual compactions also have the property that we compact on level at a time.
This property isn't inherent to manual compactions, and just the way we do things, so this can certainly be changed if beneficial.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)
sumeerbhola
left a comment
There was a problem hiding this comment.
It would be nice to have an experimental evaluation of the prototype code first, so that we can consider the performance benefits of this versus the extra complexity. Or maybe I missed some discussion that has already happened.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)
1da9476 to
2d1b081
Compare
|
@bananabrick I believe my changes to the compaction picker allow multiple input levels for manual compaction -- my test cases ( @sumeerbhola I'll run some benchmarks this week and post results on this pr. |
bananabrick
left a comment
There was a problem hiding this comment.
Manual compactions are only really used to compact an entire level into the level below it. I'll have to take a more careful look at the code in this pr, but I doubt multi input level compactions will benefit manual compactions.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)
bananabrick
left a comment
There was a problem hiding this comment.
Clarifying a bit regarding manual compactions.
We currently start with L0, and compact every single file in L0 into Lbase. Then we take the new Lbase, and compact each file into Lbase +1, and so on, until only L6 remains. It's mostly used as mitigation once lsms become inverted, or have some other odd shape.
They can also be used to compact a smaller key range, but I don't think we actually use it that way, currently. Cockroach has a CompactRange function, which uses manual compactions to compact some key range, but I don't believe it's actually used.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)
Doing manual compactions a level at a time is wasteful. We're writing data in a level != L6, only to subsequently read it and delete it. We'd be better off compacting all of the levels at once in order to minimize write amplification. The only additional challenge is to get sufficient compaction concurrency. This might require compaction L0->Lbase, and then compacting Lbase...L6 with several concurrent compactions. |
bananabrick
left a comment
There was a problem hiding this comment.
Doing manual compactions a level at a time is wasteful. We're writing data in a level != L6, only to subsequently read it and delete it. We'd be better off compacting all of the levels at once in order to minimize write amplification. The only additional challenge is to get sufficient compaction concurrency. This might require compaction L0->Lbase, and then compacting Lbase...L6 with several concurrent compactions.
I had a plan to try and create a level iter over each level + sublevel and write that directly into L6. Worth trying something like that.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)
bananabrick
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @msbutler)
compaction.go, line 429 at r1 (raw file):
Previously, msbutler (Michael Butler) wrote…
small piece of code i didn't quite understand that I modified. Why not check c.kind here?
I believe c.kind was introduced later.
itsbilal
left a comment
There was a problem hiding this comment.
Doing a drive-by review while this is still being benchmarked. Exciting stuff; looking forward to seeing how the experiments turn out!
Reviewed 6 of 8 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: 7 of 8 files reviewed, 6 unresolved discussions (waiting on @msbutler)
compaction.go, line 429 at r1 (raw file):
Previously, bananabrick (Arjun Nair) wrote…
I believe
c.kindwas introduced later.
Yep, and kind was always meant to be a log thing, not really something we gate actual compaction logic on, though I'm not sure if that is still true.
compaction.go, line 461 at r2 (raw file):
c.startLevel = &c.inputs[0] c.outputLevel = &c.inputs[1]
This could be c.inputs[len(c.inputs)-1] and that works for the extralevels case too (and also for future cases where we have more than one extra level).
compaction.go, line 965 at r2 (raw file):
if len(c.extraLevels) > 0 { if len(c.extraLevels) > 1 { return nil, errors.New("n>2 multi input level compaction not implemented yet")
nit: this could just be a panic
compaction_picker.go, line 228 at r2 (raw file):
// compactions: during the second call to setupInputs, the picked compaction's // smallest and largest keys should not decrease the key span. func (pc *pickedCompaction) maybeExpandKeySpace(smallest InternalKey, largest InternalKey) {
nit: maybeExpandBounds is better I think
compaction_picker.go, line 405 at r2 (raw file):
} if pc.outputLevel.level < numLevels-1 && (uint64(levelMaxBytes[pc.outputLevel. level]) < predOutputLevelSize) {
This makes a lot of sense, but I wonder if we can add another heuristic to catch cases where this won't trigger right now. I'm thinking of cases where, say, L0 is huge but we're doing an L4 -> L5 compaction here (let's say L4 is Lbase) that ends up spanning most of L4/L5. levelMaxBytes is likely influenced by the huge size contribution of L0 and all the levels combined, but we are already compacting into a majority of L5 so we may as well make it a multi-level compaction into L6. A simpler additional heuristic here could just be "what proportion of the old output level is participating in this compaction? If it's more than 50% or 75%, do a multilevel compaction". Thoughts?
compaction_picker.go, line 435 at r2 (raw file):
defer func() { // Move the original startLevel to its original field in pc struct pc.startLevel = origStartLevel
This feels pretty hacky and we could benefit by just passing in startLevel as an argument to setupInputs.
|
@itsbilal Thanks for the review! I'll get to your comments later this week, but I wanted to post some promising benchmarking results. I ran the following benchmark command With multi input compaction, write amp reduced by 18%, and ops/se increased by around 23%. For the multiInput binary, around 10% of the ~610 compactions were multi input (details on multi input compaction probabilities, by trial, here). When I increase the number of ycsb workers from 64 to 128, the performance results look about the same. For the multiInput binary, about 25% of the ~460 compactions were multi input. Other Notes:
My next steps:
|
2d1b081 to
678848a
Compare
678848a to
88bd990
Compare
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: 1 of 13 files reviewed, 4 unresolved discussions (waiting on @bananabrick, @itsbilal, and @msbutler)
compaction.go line 429 at r1 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Yep, and kind was always meant to be a log thing, not really something we gate actual compaction logic on, though I'm not sure if that is still true.
Cool. I'l remove my comment and keep the code as is.
compaction.go line 461 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
This could be
c.inputs[len(c.inputs)-1]and that works for the extralevels case too (and also for future cases where we have more than one extra level).
Done.
compaction_picker.go line 228 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
nit:
maybeExpandBoundsis better I think
Done
compaction_picker.go line 405 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
This makes a lot of sense, but I wonder if we can add another heuristic to catch cases where this won't trigger right now. I'm thinking of cases where, say, L0 is huge but we're doing an L4 -> L5 compaction here (let's say L4 is Lbase) that ends up spanning most of L4/L5.
levelMaxBytesis likely influenced by the huge size contribution of L0 and all the levels combined, but we are already compacting into a majority of L5 so we may as well make it a multi-level compaction into L6. A simpler additional heuristic here could just be "what proportion of the old output level is participating in this compaction? If it's more than 50% or 75%, do a multilevel compaction". Thoughts?
I like the idea! In other words: when the original compaction becomes really wide for the output level (e.g. 50% of current output data), do a multi input level compaction to avoid rewriting all of that ouptut level data to same level.
compaction_picker.go line 435 at r2 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
This feels pretty hacky and we could benefit by just passing in
startLevelas an argument to setupInputs.
Good Idea. Done.
88bd990 to
4806fa0
Compare
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: 1 of 16 files reviewed, 4 unresolved discussions (waiting on @bananabrick, @itsbilal, and @msbutler)
compaction_picker.go line 405 at r2 (raw file):
Previously, msbutler (Michael Butler) wrote…
I like the idea! In other words: when the original compaction becomes really wide for the output level (e.g. 50% of current output data), do a multi input level compaction to avoid rewriting all of that ouptut level data to same level.
when multiInput compactions are also initialized when over 50% of the original output level is involved, write amp is reduced by about the same percentage as w/o the additional trigger. Workload A did not record a multiInput compaction.
itsbilal
left a comment
There was a problem hiding this comment.
Those benchmark numbers look solid! I imagine write-heavy workloads are where we'll see the biggest improvement here. Unfortunate that ycsb/A nets no benefit, but maybe that doesn't quite get us many instances of small/out-of-shape output levels in compactions. Did you try changing value size in this workload? Maybe smaller value sizes (eg. 64) might be interesting.
Also code is looking like it's in great shape. @sumeerbhola what do you think about the threshold for merging this? It's behind an experimental flag anyway that defaults to false, and there are some clear wins here. Michael can probably keep benchmarking and/or prototyping heuristics to trigger multi-input compactions. Just wondering if we'd be cool with merging it soon or if we want to build a better understanding of when to trigger multi-input compactions first.
Reviewed 10 of 10 files at r3, 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @bananabrick and @msbutler)
-- commits line 19 at r4:
You're going to squash this before the merge, right?
compaction_picker.go line 405 at r2 (raw file):
Previously, msbutler (Michael Butler) wrote…
when multiInput compactions are also initialized when over 50% of the original output level is involved, write amp is reduced by about the same percentage as w/o the additional trigger. Workload A did not record a multiInput compaction.
Looks good!
compaction_picker.go line 172 at r3 (raw file):
lcf *manifest.L0CompactionFiles // L0 sublevel info is used for compactions out of L0. It is nil for all
nit: paragraph should start with l0SublevelInfo
options.go line 489 at r4 (raw file):
// MultiLevelCompaction allows the compaction of SSTs from more than two // levels iff a conventional two level compaction will quickly trigger a // compaction in the output level
Nit: period at the end, also you can mention the other heuristic that you just added.
|
Per convo with @itsbilal, I plan to run the following tests:
|
|
Regarding multi-level compactions and ideas like rent-or-buy, I was looking at it again in light of Tobi’s long running KV0 workload that is seeing write amp of 30 (https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1654028468337059?thread_ts=1653986683.746719&cid=CAC6K3SLU). Based on some noodling on paper, I think it is likely that a uniformly randomly distributed KV0 workload with almost no overwriting presents a hard case for new compaction heuristics — so would recommend trying ycsb F with a uniform distribution. |
|
I think we also need to understand how this is behaving in a more granular manner. If this is improving performance for some workloads, we need to better understand why. And if it is worse, by understanding why, we could potentially improve it. Here are some simple analysis examples I was trying to create on paper. I first started with a trivial unrealistic model where there is a single file per level, with 3 levels (say L1, L2, L3) with target size 1, 10 and 100. Consider 2 level compactions and empty levels to start with. We compact down to L2, 1byte at a time, so after each compaction the size of L2 follows the sequence: 1, 2, ...., 10. And then L2 is compacted down. The rewrites incurred in L2 are 1 + 2 + ... + 9 = 45, for a total data volume of 10, so a write amp of 4.5. The same argument can be repeated for L2=>L3 which will add another 4.5 write amp. But we don't actually have a single file per level. For this second analysis, ignore the target size for each level, and consider files of size 1, 2, 4 in L1, L2, L3 (since we double file sizes). 1 file in L1 overlaps with 5 files in L2 (since 1 byte in L1 overlaps with 10 bytes in L2), and 25 files in L3. Lower levels with larger number of files will tend to be near full, since when they fill up, we move only 1 file down: so unlike our unrealistic model where 1 in every 10 compactions was filling up the level, the frequency with which incoming compactions fill up a level (causing its compaction score to be 1.0) may be much higher. This by itself suggests even more potential for improvement. But maybe this is slightly mistaken: say we compact these 1 L1 + 5 L2 + 25 L3 files in 1 compaction down to L3, and say this represents a key span [f,g). Now we have removed 5*2=10 from L2. It will require 10 more files to compact down from L1 before L2 is full again, so the frequency is still 1 in 10 (in practice, our level ratios are slightly less than 10, so it may be better than this). However, we now have a danger in how the keys are distributed: after this compaction, the key span [f, g) is now empty in L2. If the next 9 compactions from L1=>L2 don't add anything to [f,g) and the 10th compaction from L1 to L2 (which triggers the target size of L2) is adding [f,g) we have a problem. Since L2 is (near) empty in [f, g), instead of just moving (doing a cheap compaction) of the file down to L2, and then picking a different key span from L2 to compact to L3 (based on the wise So to understand these results I think we need to gather lower-level metrics like:
There are probably better ideas among those who have been thinking more about this, and I do think we have something promising in this work, but I figured it may help to elaborate on the kinds of things I think we need to understand better (in order to productionize this with a high degree of confidence). |
|
It might be useful to run benchmarks which match our nightly benchmark configurations: https://cockroachdb.github.io/pebble/. The nightly benchmarks run for 1.5 hours on more powerful machines. The 3k ops/sec I see in your benchmarks seems pretty low, compared to what we see in our nightly benchmarks. LSM performance is in general pretty hard to benchmark, because performance can drop as the lsm has more data in it. |
|
I didn't read the code, but from the description this PR is implementing something like the multi-level compaction heuristic described in https://github.com/petermattis/compacttoy/blob/master/multilevel.go. Is that correct? If so, it might be worthwhile copying some of the math from those comments, or re-deriving it. (I'm frankly a bit puzzled by it while giving it a quick skim just now). PS I'd be good with separating out the multi-level compaction mechanics into a separate PR which should be relatively uncontroversial to merge. The heuristics around multi-level compactions are going to be much harder to get right and will require experimentation, benchmarking, and iteration. |
|
@msbutler - given we're getting closer to landing this (modulo the suggested changes around how to split out the heuristics / machanics), mind removing the WIP title and marking this as "Ready for review"? Thanks! |
itsbilal
left a comment
There was a problem hiding this comment.
once the function with the compaction picking heuristic in
initMultiLevelCompaction has been separated out into its own PR.
Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @bananabrick and @msbutler)
compaction.go line 659 at r4 (raw file):
if len(c.extraLevels) == 0 { return false } else if c.extraLevels[0].files.Empty() {
Was this ever happening in practice? Maybe we could throw in a panic here to confirm it's not. If we're adding an extra level it should not be empty. That way we can also get rid of this function and replace it with just the length check.
compaction_picker.go line 264 at r4 (raw file):
// maybeExpandedBounds is a helper function for setupInputs which ensures the // pickedcompaction's smallest and largest internal keys are updated iff
Nit: pickedCompaction
compaction_picker.go line 442 at r4 (raw file):
// immediately (i.e. the predicted output level size exceeds the level's max bytes) // 2. Over 50% of the original output level is included in the compaction. func (pc *pickedCompaction) initMultiInputLevelCompaction(
nit: initMultiLevelCompaction, matches the option name and is also easier to search..
compaction_picker.go line 451 at r4 (raw file):
// which would prevent files with del ranges from hitting this. // however this doesn't apply much to use cases (inverted LSM, manual compaction). predOutputFileSize := uint64(math.Max(float64(pc.startLevel.files.SizeSum()),
Since the heuristic is mostly in this method, it'd be best to just have this return false for now, and create a new PR with just this code. We can continue the benchmarking/iteration there.
metrics.go line 390 at r4 (raw file):
redact.Safe(m.Compact.NumInProgress), redact.SafeString("")) w.Printf(" ctype %9d %7d %7d %7d %7d %7d %7d (default, delete, elision, move, read, rewrite, multiInput)\n",
nit: multi-input is more consistent with the convention around here
|
@itsbilal Thanks for the review! To separate out the compaction picking heuristic as you suggest, would that imply that in this PR, we would never pick a multi-level compaction, even if this all behind an experimental flag? I'd be in favor of keeping the current compaction heuristic in this PR, as we need to init some multi level compactions to pass the tests. Alternatively, I could put all the tests in the future PR as well. |
4806fa0 to
e2f1fdd
Compare
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @itsbilal)
Previously, itsbilal (Bilal Akhtar) wrote…
You're going to squash this before the merge, right?
Done.
compaction.go line 659 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Was this ever happening in practice? Maybe we could throw in a panic here to confirm it's not. If we're adding an extra level it should not be empty. That way we can also get rid of this function and replace it with just the length check.
This is a valid case! see the comment I added:
// a multi level compaction without data in the intermediate input level;
// e.g. for a multi level compaction with levels 4,5, and 6, this could
// occur if there is no files to compact in 5, or in 5 and 6 (i.e. a move).
compaction_picker.go line 172 at r3 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
nit: paragraph should start with
l0SublevelInfo
Done.
compaction_picker.go line 264 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Nit:
pickedCompaction
Done.
compaction_picker.go line 442 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
nit:
initMultiLevelCompaction, matches the option name and is also easier to search..
Done.
compaction_picker.go line 451 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Since the heuristic is mostly in this method, it'd be best to just have this
return falsefor now, and create a new PR with just this code. We can continue the benchmarking/iteration there.
Done
options.go line 489 at r4 (raw file):
Previously, itsbilal (Bilal Akhtar) wrote…
Nit: period at the end, also you can mention the other heuristic that you just added.
Done.
itsbilal
left a comment
There was a problem hiding this comment.
Yep, based on offline discussion we agreed to move tests / heuristic code to #1746.
Reviewed 16 of 16 files at r5, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @msbutler)
compaction.go line 659 at r4 (raw file):
Previously, msbutler (Michael Butler) wrote…
This is a valid case! see the comment I added:
// a multi level compaction without data in the intermediate input level; // e.g. for a multi level compaction with levels 4,5, and 6, this could // occur if there is no files to compact in 5, or in 5 and 6 (i.e. a move).
Interesting, so this seems like a "skip this middle level when compacting" compaction to me. Pretty good idea.
This actually reminds me of another instance where we could schedule multi-level compactions and see a benefit; compactions where we're merging into very few files/bytes in the original output level (but possibly still nonzero bytes) compared to the input level. It probably means we can make those output level files the middle level and compact into one level further down. Worth benchmarking in #1746.
metrics.go line 390 at r5 (raw file):
redact.Safe(m.Compact.NumInProgress), redact.SafeString("")) w.Printf(" ctype %9d %7d %7d %7d %7d %7d %7d (default, delete, elision, move, read, rewrite, multiLevel)\n",
nit: multi-level
testdata/event_listener line 193 at r5 (raw file):
flush 3 compact 1 2.3 K 0 B 0 (size == estimated-debt, score = in-progress-bytes, in = num-in-progress) ctype 1 0 0 0 0 0 0 (default, delete, elision, move, read, rewrite, multiLevel)
nit: multi-level
Previously, levelled compaction only compacted SSTs from at most 2 levels. This change introduces the mechanics to conduct a 3 level compaction as an experimental feature. No multi level compactions will actually occur, as the new `initMultiLevelCompaction` function always returns false. Future work includes: - consider hueristics to init multi Level compactions - add mechanics to init multi level compactions with more than 3 levels Informs cockroachdb#136
e2f1fdd to
8b7a68b
Compare
Previously, levelled compaction only compacted SSTs from at most 2 levels. This
change introduces the mechanics to conduct a 3 level compaction as an
experimental feature. No multi level compactions will actually occur, as the
new
initMultiLevelCompactionfunction always returns false. Future workincludes:
Informs #136