Skip to content

compaction: introduce multi level compaction mechanics#1603

Merged
msbutler merged 1 commit intocockroachdb:masterfrom
msbutler:butler-nlevel-compaction
Jun 9, 2022
Merged

compaction: introduce multi level compaction mechanics#1603
msbutler merged 1 commit intocockroachdb:masterfrom
msbutler:butler-nlevel-compaction

Conversation

@msbutler
Copy link
Copy Markdown
Contributor

@msbutler msbutler commented Mar 25, 2022

Previously, levelled compaction only compacted SSTs from at most 2 levels. This
change introduces the mechanics to conduct a 3 level compaction as an
experimental feature. No multi level compactions will actually occur, as the
new initMultiLevelCompaction function always returns false. Future work
includes:

Informs #136

@msbutler msbutler requested a review from itsbilal March 25, 2022 23:14
@msbutler msbutler self-assigned this Mar 25, 2022
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

compaction.go Outdated

// If there are no inputs from the output level (eg, a move
// compaction), add an empty LevelInfo to info.Input.
// why not check c.kind?
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small piece of code i didn't quite understand that I modified. Why not check c.kind here?

Copy link
Copy Markdown
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive by comment:

I believe that enabling this feature for manual compactions will be tricky. Currently, the key ranges for manual compactions are picked assuming that an Li -> Li+1 compaction will be performed.

In general, manual compactions also have the property that we compact on level at a time.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)

@bananabrick bananabrick requested a review from a team March 25, 2022 23:50
Copy link
Copy Markdown
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, manual compactions also have the property that we compact on level at a time.

This property isn't inherent to manual compactions, and just the way we do things, so this can certainly be changed if beneficial.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have an experimental evaluation of the prototype code first, so that we can consider the performance benefits of this versus the extra complexity. Or maybe I missed some discussion that has already happened.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)

@msbutler msbutler force-pushed the butler-nlevel-compaction branch from 1da9476 to 2d1b081 Compare March 27, 2022 22:20
@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Mar 27, 2022

@bananabrick I believe my changes to the compaction picker allow multiple input levels for manual compaction -- my test cases (testdata/manual_compaction_multi_input) for TestManualCompaction seem to suggest this. But maybe I'm missing something?

@sumeerbhola I'll run some benchmarks this week and post results on this pr.

Copy link
Copy Markdown
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual compactions are only really used to compact an entire level into the level below it. I'll have to take a more careful look at the code in this pr, but I doubt multi input level compactions will benefit manual compactions.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)

Copy link
Copy Markdown
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarifying a bit regarding manual compactions.

We currently start with L0, and compact every single file in L0 into Lbase. Then we take the new Lbase, and compact each file into Lbase +1, and so on, until only L6 remains. It's mostly used as mitigation once lsms become inverted, or have some other odd shape.

They can also be used to compact a smaller key range, but I don't think we actually use it that way, currently. Cockroach has a CompactRange function, which uses manual compactions to compact some key range, but I don't believe it's actually used.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)

@petermattis
Copy link
Copy Markdown
Collaborator

We currently start with L0, and compact every single file in L0 into Lbase. Then we take the new Lbase, and compact each file into Lbase +1, and so on, until only L6 remains. It's mostly used as mitigation once lsms become inverted, or have some other odd shape.

Doing manual compactions a level at a time is wasteful. We're writing data in a level != L6, only to subsequently read it and delete it. We'd be better off compacting all of the levels at once in order to minimize write amplification. The only additional challenge is to get sufficient compaction concurrency. This might require compaction L0->Lbase, and then compacting Lbase...L6 with several concurrent compactions.

Copy link
Copy Markdown
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing manual compactions a level at a time is wasteful. We're writing data in a level != L6, only to subsequently read it and delete it. We'd be better off compacting all of the levels at once in order to minimize write amplification. The only additional challenge is to get sufficient compaction concurrency. This might require compaction L0->Lbase, and then compacting Lbase...L6 with several concurrent compactions.

I had a plan to try and create a level iter over each level + sublevel and write that directly into L6. Worth trying something like that.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal)

@msbutler msbutler marked this pull request as draft April 1, 2022 20:05
@msbutler msbutler changed the title compaction: introduce multi input level compaction WIP compaction: introduce multi input level compaction Apr 1, 2022
Copy link
Copy Markdown
Contributor

@bananabrick bananabrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 8 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @msbutler)


compaction.go, line 429 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

small piece of code i didn't quite understand that I modified. Why not check c.kind here?

I believe c.kind was introduced later.

Copy link
Copy Markdown
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a drive-by review while this is still being benchmarked. Exciting stuff; looking forward to seeing how the experiments turn out!

Reviewed 6 of 8 files at r1, 1 of 2 files at r2, all commit messages.
Reviewable status: 7 of 8 files reviewed, 6 unresolved discussions (waiting on @msbutler)


compaction.go, line 429 at r1 (raw file):

Previously, bananabrick (Arjun Nair) wrote…

I believe c.kind was introduced later.

Yep, and kind was always meant to be a log thing, not really something we gate actual compaction logic on, though I'm not sure if that is still true.


compaction.go, line 461 at r2 (raw file):

	c.startLevel = &c.inputs[0]
	c.outputLevel = &c.inputs[1]

This could be c.inputs[len(c.inputs)-1] and that works for the extralevels case too (and also for future cases where we have more than one extra level).


compaction.go, line 965 at r2 (raw file):

	if len(c.extraLevels) > 0 {
		if len(c.extraLevels) > 1 {
			return nil, errors.New("n>2 multi input level compaction not implemented yet")

nit: this could just be a panic


compaction_picker.go, line 228 at r2 (raw file):

// compactions: during the second call to setupInputs, the picked compaction's
// smallest and largest keys should not decrease the key span.
func (pc *pickedCompaction) maybeExpandKeySpace(smallest InternalKey, largest InternalKey) {

nit: maybeExpandBounds is better I think


compaction_picker.go, line 405 at r2 (raw file):

	}
	if pc.outputLevel.level < numLevels-1 && (uint64(levelMaxBytes[pc.outputLevel.
		level]) < predOutputLevelSize) {

This makes a lot of sense, but I wonder if we can add another heuristic to catch cases where this won't trigger right now. I'm thinking of cases where, say, L0 is huge but we're doing an L4 -> L5 compaction here (let's say L4 is Lbase) that ends up spanning most of L4/L5. levelMaxBytes is likely influenced by the huge size contribution of L0 and all the levels combined, but we are already compacting into a majority of L5 so we may as well make it a multi-level compaction into L6. A simpler additional heuristic here could just be "what proportion of the old output level is participating in this compaction? If it's more than 50% or 75%, do a multilevel compaction". Thoughts?


compaction_picker.go, line 435 at r2 (raw file):

	defer func() {
		// Move the original startLevel to its original field in pc struct
		pc.startLevel = origStartLevel

This feels pretty hacky and we could benefit by just passing in startLevel as an argument to setupInputs.

@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Apr 25, 2022

@itsbilal Thanks for the review! I'll get to your comments later this week, but I wanted to post some promising benchmarking results.

I ran the following benchmark command bench ycsb --workload F --initial-keys 100000 -c 64 -d 5m (n=10) on a binary built on this PR's branch and compared performance to a control binary built on the SHA my PR is currently rebased on. I conducted all trials on a gce worker created by cockroach/scripts/gceworker.sh.

With multi input compaction, write amp reduced by 18%, and ops/se increased by around 23%. For the multiInput binary, around 10% of the ~610 compactions were multi input (details on multi input compaction probabilities, by trial, here).

name                old ops/sec  new ops/sec  delta
ycsb/F/values=1000   3.34k ± 2%   4.13k ± 6%  +23.70%  (p=0.000 n=8+9)

name                old read     new read     delta
ycsb/F/values=1000   14.0G ± 0%   13.7G ± 1%   -2.30%  (p=0.000 n=8+8)

name                old write    new write    delta
ycsb/F/values=1000   15.1G ± 0%   15.0G ± 0%   -0.65%  (p=0.000 n=9+9)

name                old r-amp    new r-amp    delta
ycsb/F/values=1000    0.00         0.00          ~     (all equal)

name                old w-amp    new w-amp    delta
ycsb/F/values=1000    13.0 ± 2%    10.6 ± 5%  -18.22%  (p=0.000 n=8+9)

When I increase the number of ycsb workers from 64 to 128, the performance results look about the same. For the multiInput binary, about 25% of the ~460 compactions were multi input.

name                old ops/sec  new ops/sec  delta
ycsb/F/values=1000   3.47k ± 2%   4.22k ± 5%  +21.71%  (p=0.000 n=8+9)

name                old read     new read     delta
ycsb/F/values=1000   14.0G ± 0%   13.7G ± 1%   -2.04%  (p=0.000 n=8+8)

name                old write    new write    delta
ycsb/F/values=1000   15.2G ± 0%   15.1G ± 0%   -0.42%  (p=0.007 n=8+8)

name                old r-amp    new r-amp    delta
ycsb/F/values=1000    0.00         0.00          ~     (all equal)

name                old w-amp    new w-amp    delta
ycsb/F/values=1000    12.6 ± 2%    10.5 ± 5%  -16.77%  (p=0.000 n=8+9)

Other Notes:

  • Running YCSB workload A using 64 or 128 workers triggered zero multiInput compactions, holding all else constant.
  • I created a script to automate the benchmarking process. Happy to clean it up and add it to some CRL repo for others to use!
  • the binary that the multInput benchmarks were run on do not include Bilal's suggestion to trigger multinput compactions when more than 50% of the original output level is involved in the compaction. When I include this trigger, the results don't significantly change.

My next steps:

  • respond to any questions on my benchmarking results
  • address Bilal's feedback and any other PR feedback
  • depending on feedback, I can either work on the general N multiInput level compaction case or on prototyping the balanced rent or buy compaction policy

@msbutler msbutler force-pushed the butler-nlevel-compaction branch from 2d1b081 to 678848a Compare April 25, 2022 19:00
@msbutler msbutler force-pushed the butler-nlevel-compaction branch from 678848a to 88bd990 Compare May 3, 2022 20:12
Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 13 files reviewed, 4 unresolved discussions (waiting on @bananabrick, @itsbilal, and @msbutler)


compaction.go line 429 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yep, and kind was always meant to be a log thing, not really something we gate actual compaction logic on, though I'm not sure if that is still true.

Cool. I'l remove my comment and keep the code as is.


compaction.go line 461 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This could be c.inputs[len(c.inputs)-1] and that works for the extralevels case too (and also for future cases where we have more than one extra level).

Done.


compaction_picker.go line 228 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

nit: maybeExpandBounds is better I think

Done


compaction_picker.go line 405 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This makes a lot of sense, but I wonder if we can add another heuristic to catch cases where this won't trigger right now. I'm thinking of cases where, say, L0 is huge but we're doing an L4 -> L5 compaction here (let's say L4 is Lbase) that ends up spanning most of L4/L5. levelMaxBytes is likely influenced by the huge size contribution of L0 and all the levels combined, but we are already compacting into a majority of L5 so we may as well make it a multi-level compaction into L6. A simpler additional heuristic here could just be "what proportion of the old output level is participating in this compaction? If it's more than 50% or 75%, do a multilevel compaction". Thoughts?

I like the idea! In other words: when the original compaction becomes really wide for the output level (e.g. 50% of current output data), do a multi input level compaction to avoid rewriting all of that ouptut level data to same level.


compaction_picker.go line 435 at r2 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

This feels pretty hacky and we could benefit by just passing in startLevel as an argument to setupInputs.

Good Idea. Done.

@msbutler msbutler force-pushed the butler-nlevel-compaction branch from 88bd990 to 4806fa0 Compare May 4, 2022 14:58
Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 16 files reviewed, 4 unresolved discussions (waiting on @bananabrick, @itsbilal, and @msbutler)


compaction_picker.go line 405 at r2 (raw file):

Previously, msbutler (Michael Butler) wrote…

I like the idea! In other words: when the original compaction becomes really wide for the output level (e.g. 50% of current output data), do a multi input level compaction to avoid rewriting all of that ouptut level data to same level.

when multiInput compactions are also initialized when over 50% of the original output level is involved, write amp is reduced by about the same percentage as w/o the additional trigger. Workload A did not record a multiInput compaction.

Copy link
Copy Markdown
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those benchmark numbers look solid! I imagine write-heavy workloads are where we'll see the biggest improvement here. Unfortunate that ycsb/A nets no benefit, but maybe that doesn't quite get us many instances of small/out-of-shape output levels in compactions. Did you try changing value size in this workload? Maybe smaller value sizes (eg. 64) might be interesting.

Also code is looking like it's in great shape. @sumeerbhola what do you think about the threshold for merging this? It's behind an experimental flag anyway that defaults to false, and there are some clear wins here. Michael can probably keep benchmarking and/or prototyping heuristics to trigger multi-input compactions. Just wondering if we'd be cool with merging it soon or if we want to build a better understanding of when to trigger multi-input compactions first.

Reviewed 10 of 10 files at r3, 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @bananabrick and @msbutler)


-- commits line 19 at r4:
You're going to squash this before the merge, right?


compaction_picker.go line 405 at r2 (raw file):

Previously, msbutler (Michael Butler) wrote…

when multiInput compactions are also initialized when over 50% of the original output level is involved, write amp is reduced by about the same percentage as w/o the additional trigger. Workload A did not record a multiInput compaction.

Looks good!


compaction_picker.go line 172 at r3 (raw file):

	lcf *manifest.L0CompactionFiles

	// L0 sublevel info is used for compactions out of L0. It is nil for all

nit: paragraph should start with l0SublevelInfo


options.go line 489 at r4 (raw file):

		// MultiLevelCompaction allows the compaction of SSTs from more than two
		// levels iff a conventional two level compaction will quickly trigger a
		// compaction in the output level

Nit: period at the end, also you can mention the other heuristic that you just added.

@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Jun 1, 2022

Per convo with @itsbilal, I plan to run the following tests:

  • Run all the benchmarks for longer than 5 mins. Try 15 minutes for each trial
  • Tweak params to trigger multi input compactions in workload ycsb A, which only conducts updates to existing keys:
    • bump the initial value count to 1 or 10 million, so updates will cover a wider key space, leading to more levels in led LSM.
    • bump the value size, leading to more flushed SSTs, increasing probability of an inverted LSM
  • Investigate why write amp and ops/sec remained constant on workload F when number of concurrent workers doubled. -
    • Perhaps we're CPU constrained. Try benchmarking on a roachprod node (more CPUs) and constrain IOPS (use EBS instead of local SSD).

@sumeerbhola
Copy link
Copy Markdown
Collaborator

Regarding multi-level compactions and ideas like rent-or-buy, I was looking at it again in light of Tobi’s long running KV0 workload that is seeing write amp of 30 (https://cockroachlabs.slack.com/archives/CAC6K3SLU/p1654028468337059?thread_ts=1653986683.746719&cid=CAC6K3SLU).

Based on some noodling on paper, I think it is likely that a uniformly randomly distributed KV0 workload with almost no overwriting presents a hard case for new compaction heuristics — so would recommend trying ycsb F with a uniform distribution.
Also, I don't think we can conclude much without many levels in the LSM. How many were there at the end of the benchmarks here? I think we need an LSM that ends up with ~400GB, so at least 6 levels are populated.

@sumeerbhola
Copy link
Copy Markdown
Collaborator

I think we also need to understand how this is behaving in a more granular manner. If this is improving performance for some workloads, we need to better understand why. And if it is worse, by understanding why, we could potentially improve it.

Here are some simple analysis examples I was trying to create on paper.

I first started with a trivial unrealistic model where there is a single file per level, with 3 levels (say L1, L2, L3) with target size 1, 10 and 100. Consider 2 level compactions and empty levels to start with. We compact down to L2, 1byte at a time, so after each compaction the size of L2 follows the sequence: 1, 2, ...., 10. And then L2 is compacted down. The rewrites incurred in L2 are 1 + 2 + ... + 9 = 45, for a total data volume of 10, so a write amp of 4.5. The same argument can be repeated for L2=>L3 which will add another 4.5 write amp.
Now consider 3-level compactions such that the last compaction into L2 that caused it to grow to 10 also compacts down to L3. The rewrites incurred in L2 are 1 + 2 + ... + 8 = 36, so the write amp is only 3.6. So we've shaved off 20%, which is nice.

But we don't actually have a single file per level. For this second analysis, ignore the target size for each level, and consider files of size 1, 2, 4 in L1, L2, L3 (since we double file sizes). 1 file in L1 overlaps with 5 files in L2 (since 1 byte in L1 overlaps with 10 bytes in L2), and 25 files in L3. Lower levels with larger number of files will tend to be near full, since when they fill up, we move only 1 file down: so unlike our unrealistic model where 1 in every 10 compactions was filling up the level, the frequency with which incoming compactions fill up a level (causing its compaction score to be 1.0) may be much higher. This by itself suggests even more potential for improvement. But maybe this is slightly mistaken: say we compact these 1 L1 + 5 L2 + 25 L3 files in 1 compaction down to L3, and say this represents a key span [f,g). Now we have removed 5*2=10 from L2. It will require 10 more files to compact down from L1 before L2 is full again, so the frequency is still 1 in 10 (in practice, our level ratios are slightly less than 10, so it may be better than this).

However, we now have a danger in how the keys are distributed: after this compaction, the key span [f, g) is now empty in L2. If the next 9 compactions from L1=>L2 don't add anything to [f,g) and the 10th compaction from L1 to L2 (which triggers the target size of L2) is adding [f,g) we have a problem. Since L2 is (near) empty in [f, g), instead of just moving (doing a cheap compaction) of the file down to L2, and then picking a different key span from L2 to compact to L3 (based on the wise kMinOverlappingRatio heuristic), we would compact down [f,g) to L3 which may be among the worst in terms of overlapping ratio (since we've taken 1 byte in L1 and possibly amplified it by compacting it with a mean of 50 bytes in L3, so a ratio of 50). This is the kind of thing balanced rent-or-buy was also designed to prevent. My understanding is that this prototype prevents this with a different approach using highOutputInvolvement, which I suspect will not work when lower levels have a lot of data (though it may work for higher levels).

highOutputInvolvement := (float64(pc.outputLevel.files.SizeSum()) / float64(curOutputLevelSize)) > 0.5

So to understand these results I think we need to gather lower-level metrics like:

  • the frequency of incoming compactions that cause size to be exceeded, for each level.
  • for (Li,Li+1,Li+2) compactions, for each i, what is the mean byte ratio of the bytes in each level that are involved in the compaction.

There are probably better ideas among those who have been thinking more about this, and I do think we have something promising in this work, but I figured it may help to elaborate on the kinds of things I think we need to understand better (in order to productionize this with a high degree of confidence).

@bananabrick
Copy link
Copy Markdown
Contributor

bananabrick commented Jun 2, 2022

@msbutler

It might be useful to run benchmarks which match our nightly benchmark configurations: https://cockroachdb.github.io/pebble/.

The nightly benchmarks run for 1.5 hours on more powerful machines. The 3k ops/sec I see in your benchmarks seems pretty low, compared to what we see in our nightly benchmarks.

LSM performance is in general pretty hard to benchmark, because performance can drop as the lsm has more data in it.

@petermattis
Copy link
Copy Markdown
Collaborator

I didn't read the code, but from the description this PR is implementing something like the multi-level compaction heuristic described in https://github.com/petermattis/compacttoy/blob/master/multilevel.go. Is that correct? If so, it might be worthwhile copying some of the math from those comments, or re-deriving it. (I'm frankly a bit puzzled by it while giving it a quick skim just now).

PS I'd be good with separating out the multi-level compaction mechanics into a separate PR which should be relatively uncontroversial to merge. The heuristics around multi-level compactions are going to be much harder to get right and will require experimentation, benchmarking, and iteration.

@nicktrav
Copy link
Copy Markdown
Contributor

nicktrav commented Jun 6, 2022

@msbutler - given we're getting closer to landing this (modulo the suggested changes around how to split out the heuristics / machanics), mind removing the WIP title and marking this as "Ready for review"? Thanks!

@msbutler msbutler changed the title WIP compaction: introduce multi input level compaction compaction: introduce multi input level compaction Jun 6, 2022
@msbutler msbutler marked this pull request as ready for review June 6, 2022 17:50
Copy link
Copy Markdown
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: once the function with the compaction picking heuristic in initMultiLevelCompaction has been separated out into its own PR.

Reviewable status: all files reviewed, 9 unresolved discussions (waiting on @bananabrick and @msbutler)


compaction.go line 659 at r4 (raw file):

	if len(c.extraLevels) == 0 {
		return false
	} else if c.extraLevels[0].files.Empty() {

Was this ever happening in practice? Maybe we could throw in a panic here to confirm it's not. If we're adding an extra level it should not be empty. That way we can also get rid of this function and replace it with just the length check.


compaction_picker.go line 264 at r4 (raw file):

// maybeExpandedBounds is a helper function for setupInputs which ensures the
// pickedcompaction's smallest and largest internal keys are updated iff

Nit: pickedCompaction


compaction_picker.go line 442 at r4 (raw file):

//    immediately (i.e. the predicted output level size exceeds the level's max bytes)
// 2. Over 50% of the original output level is included in the compaction.
func (pc *pickedCompaction) initMultiInputLevelCompaction(

nit: initMultiLevelCompaction, matches the option name and is also easier to search..


compaction_picker.go line 451 at r4 (raw file):

	// which would prevent files with del ranges from hitting this.
	// however this doesn't apply much to use cases (inverted LSM, manual compaction).
	predOutputFileSize := uint64(math.Max(float64(pc.startLevel.files.SizeSum()),

Since the heuristic is mostly in this method, it'd be best to just have this return false for now, and create a new PR with just this code. We can continue the benchmarking/iteration there.


metrics.go line 390 at r4 (raw file):

		redact.Safe(m.Compact.NumInProgress),
		redact.SafeString(""))
	w.Printf("  ctype %9d %7d %7d %7d %7d %7d %7d  (default, delete, elision, move, read, rewrite, multiInput)\n",

nit: multi-input is more consistent with the convention around here

@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Jun 8, 2022

@itsbilal Thanks for the review! To separate out the compaction picking heuristic as you suggest, would that imply that in this PR, we would never pick a multi-level compaction, even if this all behind an experimental flag? I'd be in favor of keeping the current compaction heuristic in this PR, as we need to init some multi level compactions to pass the tests. Alternatively, I could put all the tests in the future PR as well.

Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @itsbilal)


-- commits line 19 at r4:

Previously, itsbilal (Bilal Akhtar) wrote…

You're going to squash this before the merge, right?

Done.


compaction.go line 659 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Was this ever happening in practice? Maybe we could throw in a panic here to confirm it's not. If we're adding an extra level it should not be empty. That way we can also get rid of this function and replace it with just the length check.

This is a valid case! see the comment I added:

// a multi level compaction without data in the intermediate input level;
// e.g. for a multi level compaction with levels 4,5, and 6, this could
// occur if there is no files to compact in 5, or in 5 and 6 (i.e. a move).

compaction_picker.go line 172 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

nit: paragraph should start with l0SublevelInfo

Done.


compaction_picker.go line 264 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: pickedCompaction

Done.


compaction_picker.go line 442 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

nit: initMultiLevelCompaction, matches the option name and is also easier to search..

Done.


compaction_picker.go line 451 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Since the heuristic is mostly in this method, it'd be best to just have this return false for now, and create a new PR with just this code. We can continue the benchmarking/iteration there.

Done


options.go line 489 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Nit: period at the end, also you can mention the other heuristic that you just added.

Done.

@msbutler msbutler changed the title compaction: introduce multi input level compaction compaction: introduce multi input level compaction mechanics Jun 8, 2022
@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Jun 8, 2022

@itsbilal addressed all your nits and moved heuristics to #1746 for further discusion.

Copy link
Copy Markdown
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, based on offline discussion we agreed to move tests / heuristic code to #1746.

:lgtm:

Reviewed 16 of 16 files at r5, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @msbutler)


compaction.go line 659 at r4 (raw file):

Previously, msbutler (Michael Butler) wrote…

This is a valid case! see the comment I added:

// a multi level compaction without data in the intermediate input level;
// e.g. for a multi level compaction with levels 4,5, and 6, this could
// occur if there is no files to compact in 5, or in 5 and 6 (i.e. a move).

Interesting, so this seems like a "skip this middle level when compacting" compaction to me. Pretty good idea.

This actually reminds me of another instance where we could schedule multi-level compactions and see a benefit; compactions where we're merging into very few files/bytes in the original output level (but possibly still nonzero bytes) compared to the input level. It probably means we can make those output level files the middle level and compact into one level further down. Worth benchmarking in #1746.


metrics.go line 390 at r5 (raw file):

		redact.Safe(m.Compact.NumInProgress),
		redact.SafeString(""))
	w.Printf("  ctype %9d %7d %7d %7d %7d %7d %7d  (default, delete, elision, move, read, rewrite, multiLevel)\n",

nit: multi-level


testdata/event_listener line 193 at r5 (raw file):

  flush         3
compact         1   2.3 K     0 B       0          (size == estimated-debt, score = in-progress-bytes, in = num-in-progress)
  ctype         1       0       0       0       0       0       0  (default, delete, elision, move, read, rewrite, multiLevel)

nit: multi-level

Previously, levelled compaction only compacted SSTs from at most 2 levels. This
change introduces the mechanics to conduct a 3 level compaction as an
experimental feature. No multi level compactions will actually occur, as the
new `initMultiLevelCompaction` function always returns false. Future work
includes:
  - consider hueristics to init multi Level compactions
  - add mechanics to init multi level compactions with more than 3 levels

Informs cockroachdb#136
@msbutler msbutler force-pushed the butler-nlevel-compaction branch from e2f1fdd to 8b7a68b Compare June 9, 2022 19:45
@msbutler msbutler merged commit 76a1fff into cockroachdb:master Jun 9, 2022
@msbutler msbutler deleted the butler-nlevel-compaction branch June 9, 2022 21:32
@msbutler msbutler changed the title compaction: introduce multi input level compaction mechanics compaction: introduce multi level compaction mechanics Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants