Skip to content

compaction: introduce multilevel write amp hueristic#2787

Merged
msbutler merged 1 commit intocockroachdb:masterfrom
msbutler:butler-merge-ml
Sep 6, 2023
Merged

compaction: introduce multilevel write amp hueristic#2787
msbutler merged 1 commit intocockroachdb:masterfrom
msbutler:butler-merge-ml

Conversation

@msbutler
Copy link
Copy Markdown
Contributor

@msbutler msbutler commented Aug 4, 2023

This patch teaches pebble to pick a compaction involving 3 levels based on a
new WriteAmp heuristic, so long as the input level is below L0. Recall that
pebble#1603 introduced the mechanics to conduct a multilevel compaction by
teaching the compaction picker to call setupInputs() on the original output
level, finding files to include in a third level. The new WriteAmp heuristic
will choose this 3 level compaction if its predicted WriteAmp is less than the
original compaction's predicted WriteAmp. Multilevel compactions will be
considered for default and manual compactions, but not read triggered
compactions.

Predicted WriteAmp equals:
= sum(size of all input files)/sum(size of all files from higher levels)

According to several write intensive experiments, enabling the WriteAmp
heuristic has the potential to slightly reduce overall WriteAmplification,
increase throughput, and prevent inverted LSMs.

Future work could tweak the heuristic or enable compactions with more than 3 levels.

Informs #136

Release note: none

@msbutler msbutler self-assigned this Aug 4, 2023
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 5 unresolved discussions (waiting on @msbutler)


event.go line 59 at r1 (raw file):

		redact.Safe(formatFileNums(i.Tables)),
		redact.Safe(humanize.Bytes.Uint64(tablesTotalSize(i.Tables))),
		redact.Safe(i.Score))

I'm not sure if you're intending to keep this in when merged (I'm not opposed), but any updates to the compaction log entries require corresponding updates in the compaction log-parsing tool: https://github.com/cockroachdb/pebble/blob/master/tool/logs/compaction.go

I filed #2796 for the testing gap here.


metrics.go line 75 at r1 (raw file):

	TablesMoved uint64

	// BytesInTopML are the total bytes in a multilevel compaction coming from the top level.

nit: how about grouping within an anonymous struct, eg

MultiLevel struct {
    // BytesInTop ...
    BytesInTop uint64
    // BytesInTop ...
    BytesIn    uint64
    // BytesRead ...
    BytesRead  uint64
}

metrics.go line 147 at r1 (raw file):

		ReadCount         int64
		RewriteCount      int64
		SingleLevelCount  int64

can the single-level count be computed from the other counts?


options.go line 599 at r1 (raw file):

		MultiLevelCompactionHueristic MultiLevelHeuristic

		MultiLevelCompactionAllowL0 bool

nit: What do you think of stashing this on the WriteAmpHeuristic/MultiLevelHeuristic?


testdata/singledel_manual_compaction_set_with_del line 178 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

@jbowens while cleaning up the ML PR, I noticed the need to change the results in this test case. Are ML compactions violating some invariant here? Note that this test still fails on some later line.

I don't think any invariants are being violated here, but this change from a L2->L3 compaction into a L2+L3->L4 multi-level compactions causes this test to no longer exercise the intended test case (a DEL and a SINGLEDEL meeting in a compaction). I think we should add a define flag that disables multi-level compactions and use it in this test (at line 147) keep the old picked compactions.

Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 17 files reviewed, 5 unresolved discussions (waiting on @jbowens and @msbutler)


event.go line 59 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm not sure if you're intending to keep this in when merged (I'm not opposed), but any updates to the compaction log entries require corresponding updates in the compaction log-parsing tool: https://github.com/cockroachdb/pebble/blob/master/tool/logs/compaction.go

I filed #2796 for the testing gap here.

Ah good point. Before I go make those changes, do you have any suggestions for how I've modified the compaction log lines?

A default compaction now displays the compaction score of each input level and the single and ML overlapping ratio. As an example, here's how a single level compaction's logs line currently look in my PR:

[JOB 92] compacting(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99; OverlappingRatio: Single 8.03, Multi 25.05;
[JOB 92] compacted(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99 -> L5 [250871] (7.1MB), in 0.1s (0.1s total), output rate 111MB/s

And a multi level:

[JOB 11] compacting(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00; OverlappingRatio: Single 3.77, Multi 1.46;
[JOB 11] compacted(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00 -> L4 [250859 250860 250861 250862 250863] (46MB), in 0.2s (0.2s total), output rate 185MB/s

metrics.go line 75 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: how about grouping within an anonymous struct, eg

MultiLevel struct {
    // BytesInTop ...
    BytesInTop uint64
    // BytesInTop ...
    BytesIn    uint64
    // BytesRead ...
    BytesRead  uint64
}

good point. done.


metrics.go line 147 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can the single-level count be computed from the other counts?

Deleted. Wasn't really necessary.


options.go line 599 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: What do you think of stashing this on the WriteAmpHeuristic/MultiLevelHeuristic?

Good nit. Fixed.


testdata/singledel_manual_compaction_set_with_del line 178 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I don't think any invariants are being violated here, but this change from a L2->L3 compaction into a L2+L3->L4 multi-level compactions causes this test to no longer exercise the intended test case (a DEL and a SINGLEDEL meeting in a compaction). I think we should add a define flag that disables multi-level compactions and use it in this test (at line 147) keep the old picked compactions.

Sounds good. Added that define flag.

@msbutler msbutler force-pushed the butler-merge-ml branch 3 times, most recently from 411bbf4 to 8418b20 Compare August 9, 2023 12:18
Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens and @msbutler)


event.go line 59 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

Ah good point. Before I go make those changes, do you have any suggestions for how I've modified the compaction log lines?

A default compaction now displays the compaction score of each input level and the single and ML overlapping ratio. As an example, here's how a single level compaction's logs line currently look in my PR:

[JOB 92] compacting(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99; OverlappingRatio: Single 8.03, Multi 25.05;
[JOB 92] compacted(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99 -> L5 [250871] (7.1MB), in 0.1s (0.1s total), output rate 111MB/s

And a multi level:

[JOB 11] compacting(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00; OverlappingRatio: Single 3.77, Multi 1.46;
[JOB 11] compacted(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00 -> L4 [250859 250860 250861 250862 250863] (46MB), in 0.2s (0.2s total), output rate 185MB/s

Perhaps its also worth bikeshedding how we'd want to modify the window logger. I think the current impl looks like this: b5deee5

One nice quality of this format is that it cleanly partitions the kinds of compactions. While ML currently can only get triggered in a default compaction, we could enable it for move and read triggered compactions in the future if we wanted to. I propose adding an extra(B) column between the in(B) and out(B) columns and keep everything else the same.

Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens and @msbutler)


event.go line 59 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

Perhaps its also worth bikeshedding how we'd want to modify the window logger. I think the current impl looks like this: b5deee5

One nice quality of this format is that it cleanly partitions the kinds of compactions. While ML currently can only get triggered in a default compaction, we could enable it for move and read triggered compactions in the future if we wanted to. I propose adding an extra(B) column between the in(B) and out(B) columns and keep everything else the same.

^well actually, with the given window format, the in(B) column represents the sum of bytes over all input files and out(B) represents the size of the new file created from the compaction. Given this, it wouldn't make much sense to just log the bytes contributed from the extra level.

We could split in(B) into 3 columns (top level, extra level, output level), but that seems out of scope in this PR.

For this PR, is it alright if I just update the tests and ensure ML compaction log lines get parsed correctly?

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 20 files reviewed, 1 unresolved discussion (waiting on @msbutler)


event.go line 59 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

^well actually, with the given window format, the in(B) column represents the sum of bytes over all input files and out(B) represents the size of the new file created from the compaction. Given this, it wouldn't make much sense to just log the bytes contributed from the extra level.

We could split in(B) into 3 columns (top level, extra level, output level), but that seems out of scope in this PR.

For this PR, is it alright if I just update the tests and ensure ML compaction log lines get parsed correctly?

Yeah, I think that's alright. Mind creating an issue for the follow up though?

@msbutler msbutler force-pushed the butler-merge-ml branch 3 times, most recently from d54db2e to 12d6d95 Compare August 11, 2023 12:05
@msbutler msbutler marked this pull request as ready for review August 11, 2023 12:10
@msbutler msbutler requested a review from sumeerbhola August 11, 2023 13:55
@msbutler
Copy link
Copy Markdown
Contributor Author

@sumeerbhola @jbowens RFAL! there were 2 minor test failures bc I didn't update the test case log lines correctly. I will update these after y'all have a read.

@@ -120,7 +120,7 @@ reset

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will plan to update the logs in this test file if y'all are cool with my log line updates.

@jbowens
Copy link
Copy Markdown
Contributor

jbowens commented Aug 30, 2023

Sorry for the delay here; hoping to review again tomorrow

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 17 files at r1, 5 of 13 files at r2, 9 of 12 files at r3.
Reviewable status: 19 of 25 files reviewed, 5 unresolved discussions (waiting on @msbutler and @sumeerbhola)


compaction_picker.go line 1407 at r3 (raw file):

func (p *compactionPickerByScore) updatePickerMetrics(
	env compactionEnv, pc pickedCompaction, candInfo [7]candidateLevelInfo,

pickedCompaction is a hefty struct. Can we take this by pointer?

And in that case, should we even bother returning the value out too? Call sites are modifying the picked compaction that they pass in.


compaction_picker.go line 1412 at r3 (raw file):

	// candInfo is sorted by score, not by compaction level.
	infoByLevel := [7]candidateLevelInfo{}

nit: s/7/numLevels/


compaction_picker.go line 1763 at r3 (raw file):

) *pickedCompaction {
	pcMulti := pcOrig.clone()
	if !pcMulti.setupMultiLevelCandidate(opts, diskAvailBytes) {

I'm mildly concerned about the amount of extra work involved in cloning and re-constructing inputs. Compaction-picking holds d.mu, which prevents WAL rotations, in-progress flushes and compactions from completing, etc. How about leaving a TODO to microbenchmark and possibly restructure to limit duplicate work here?


options.go line 1077 at r3 (raw file):

		o.Experimental.CPUWorkPermissionGranter = defaultCPUWorkGranter{}
	}
	if o.Experimental.MultiLevelCompactionHueristic == nil {

nit: I'm noticing now that heuristic is misspelled here.

Copy link
Copy Markdown
Contributor Author

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 25 files reviewed, 4 unresolved discussions (waiting on @jbowens and @sumeerbhola)


compaction_picker.go line 1407 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

pickedCompaction is a hefty struct. Can we take this by pointer?

And in that case, should we even bother returning the value out too? Call sites are modifying the picked compaction that they pass in.

Good idea. Done.


compaction_picker.go line 1412 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/7/numLevels/

Done.


compaction_picker.go line 1763 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm mildly concerned about the amount of extra work involved in cloning and re-constructing inputs. Compaction-picking holds d.mu, which prevents WAL rotations, in-progress flushes and compactions from completing, etc. How about leaving a TODO to microbenchmark and possibly restructure to limit duplicate work here?

Added the todo


event.go line 59 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I think that's alright. Mind creating an issue for the follow up though?

Done. #2814


options.go line 1077 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I'm noticing now that heuristic is misspelled here.

heh. fixed.


tool/logs/testdata/compactions line 120 at r3 (raw file):

Previously, msbutler (Michael Butler) wrote…

I will plan to update the logs in this test file if y'all are cool with my log line updates.

Just Updated.

Copy link
Copy Markdown
Contributor

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

Reviewed 1 of 7 files at r5, all commit messages.
Reviewable status: 10 of 25 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)

This patch teaches pebble to pick a compaction involving 3 levels based on a
new WriteAmp heuristic, so long as the input level is below L0. Recall that
pebble#1603 introduced the mechanics to conduct a multilevel compaction by
teaching the compaction picker to call setupInputs() on the original output
level, finding files to include in a third level. The new WriteAmp heuristic
will choose this 3 level compaction if its predicted WriteAmp is less than the
original compaction's predicted WriteAmp. Multilevel compactions will be
considered for default and manual compactions, but not read triggered
compactions.

Predicted WriteAmp equals:
= sum(size of all input files)/sum(size of all files from higher levels)

According to several write intensive experiments, enabling the WriteAmp
heuristic has the potential to slightly reduce overall WriteAmplification,
increase throughput, and prevent inverted LSMs.
- See results from running pebble ycsb F with 64, 96, and 124 workers:
  cockroachdb#2179 (comment)
- See write up from running the replay tool with kv0
  https://docs.google.com/document/d/1OmerofbtQfZUWQqLklqwmDJdVXI6qJzCLZQjsGirmFk/edit

Future work could tweak the heuristic or enable compactions with more than 3 levels.

Informs cockroachdb#136

Release note: none
@msbutler
Copy link
Copy Markdown
Contributor Author

msbutler commented Sep 6, 2023

rebased to address a few merge conflicts. nothing complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants