compaction: introduce multilevel write amp hueristic by msbutler · Pull Request #2787 · cockroachdb/pebble

msbutler · 2023-08-04T01:18:42Z

This patch teaches pebble to pick a compaction involving 3 levels based on a
new WriteAmp heuristic, so long as the input level is below L0. Recall that
pebble#1603 introduced the mechanics to conduct a multilevel compaction by
teaching the compaction picker to call setupInputs() on the original output
level, finding files to include in a third level. The new WriteAmp heuristic
will choose this 3 level compaction if its predicted WriteAmp is less than the
original compaction's predicted WriteAmp. Multilevel compactions will be
considered for default and manual compactions, but not read triggered
compactions.

Predicted WriteAmp equals:
= sum(size of all input files)/sum(size of all files from higher levels)

According to several write intensive experiments, enabling the WriteAmp
heuristic has the potential to slightly reduce overall WriteAmplification,
increase throughput, and prevent inverted LSMs.

See results from running pebble ycsb F with 64, 96, and 124 workers:
compaction: introduce multilevel write-amp heuristic #2179 (comment)
See write up from running the replay tool with kv0
https://docs.google.com/document/d/1OmerofbtQfZUWQqLklqwmDJdVXI6qJzCLZQjsGirmFk/edit

Future work could tweak the heuristic or enable compactions with more than 3 levels.

Informs #136

Release note: none

cockroach-teamcity · 2023-08-04T01:18:48Z

This change is

testdata/singledel_manual_compaction_set_with_del

jbowens

Reviewable status: 0 of 17 files reviewed, 5 unresolved discussions (waiting on @msbutler)

event.go line 59 at r1 (raw file):

		redact.Safe(formatFileNums(i.Tables)),
		redact.Safe(humanize.Bytes.Uint64(tablesTotalSize(i.Tables))),
		redact.Safe(i.Score))

I'm not sure if you're intending to keep this in when merged (I'm not opposed), but any updates to the compaction log entries require corresponding updates in the compaction log-parsing tool: https://github.com/cockroachdb/pebble/blob/master/tool/logs/compaction.go

I filed #2796 for the testing gap here.

metrics.go line 75 at r1 (raw file):

	TablesMoved uint64

	// BytesInTopML are the total bytes in a multilevel compaction coming from the top level.

nit: how about grouping within an anonymous struct, eg

MultiLevel struct {
    // BytesInTop ...
    BytesInTop uint64
    // BytesInTop ...
    BytesIn    uint64
    // BytesRead ...
    BytesRead  uint64
}

metrics.go line 147 at r1 (raw file):

		ReadCount         int64
		RewriteCount      int64
		SingleLevelCount  int64

can the single-level count be computed from the other counts?

options.go line 599 at r1 (raw file):

		MultiLevelCompactionHueristic MultiLevelHeuristic

		MultiLevelCompactionAllowL0 bool

nit: What do you think of stashing this on the WriteAmpHeuristic/MultiLevelHeuristic?

testdata/singledel_manual_compaction_set_with_del line 178 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

@jbowens while cleaning up the ML PR, I noticed the need to change the results in this test case. Are ML compactions violating some invariant here? Note that this test still fails on some later line.

I don't think any invariants are being violated here, but this change from a L2->L3 compaction into a L2+L3->L4 multi-level compactions causes this test to no longer exercise the intended test case (a DEL and a SINGLEDEL meeting in a compaction). I think we should add a define flag that disables multi-level compactions and use it in this test (at line 147) keep the old picked compactions.

msbutler

Reviewable status: 0 of 17 files reviewed, 5 unresolved discussions (waiting on @jbowens and @msbutler)

event.go line 59 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm not sure if you're intending to keep this in when merged (I'm not opposed), but any updates to the compaction log entries require corresponding updates in the compaction log-parsing tool: https://github.com/cockroachdb/pebble/blob/master/tool/logs/compaction.go

I filed #2796 for the testing gap here.

Ah good point. Before I go make those changes, do you have any suggestions for how I've modified the compaction log lines?

A default compaction now displays the compaction score of each input level and the single and ML overlapping ratio. As an example, here's how a single level compaction's logs line currently look in my PR:

[JOB 92] compacting(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99; OverlappingRatio: Single 8.03, Multi 25.05;
[JOB 92] compacted(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99 -> L5 [250871] (7.1MB), in 0.1s (0.1s total), output rate 111MB/s

And a multi level:

[JOB 11] compacting(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00; OverlappingRatio: Single 3.77, Multi 1.46;
[JOB 11] compacted(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00 -> L4 [250859 250860 250861 250862 250863] (46MB), in 0.2s (0.2s total), output rate 185MB/s

metrics.go line 75 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: how about grouping within an anonymous struct, eg

MultiLevel struct {
    // BytesInTop ...
    BytesInTop uint64
    // BytesInTop ...
    BytesIn    uint64
    // BytesRead ...
    BytesRead  uint64
}

good point. done.

metrics.go line 147 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can the single-level count be computed from the other counts?

Deleted. Wasn't really necessary.

options.go line 599 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: What do you think of stashing this on the WriteAmpHeuristic/MultiLevelHeuristic?

Good nit. Fixed.

testdata/singledel_manual_compaction_set_with_del line 178 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I don't think any invariants are being violated here, but this change from a L2->L3 compaction into a L2+L3->L4 multi-level compactions causes this test to no longer exercise the intended test case (a DEL and a SINGLEDEL meeting in a compaction). I think we should add a define flag that disables multi-level compactions and use it in this test (at line 147) keep the old picked compactions.

Sounds good. Added that define flag.

msbutler

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens and @msbutler)

event.go line 59 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

Ah good point. Before I go make those changes, do you have any suggestions for how I've modified the compaction log lines?

A default compaction now displays the compaction score of each input level and the single and ML overlapping ratio. As an example, here's how a single level compaction's logs line currently look in my PR:
[JOB 92] compacting(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99; OverlappingRatio: Single 8.03, Multi 25.05;
[JOB 92] compacted(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99 -> L5 [250871] (7.1MB), in 0.1s (0.1s total), output rate 111MB/s
And a multi level:
[JOB 11] compacting(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00; OverlappingRatio: Single 3.77, Multi 1.46;
[JOB 11] compacted(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00 -> L4 [250859 250860 250861 250862 250863] (46MB), in 0.2s (0.2s total), output rate 185MB/s

Perhaps its also worth bikeshedding how we'd want to modify the window logger. I think the current impl looks like this: b5deee5

One nice quality of this format is that it cleanly partitions the kinds of compactions. While ML currently can only get triggered in a default compaction, we could enable it for move and read triggered compactions in the future if we wanted to. I propose adding an extra(B) column between the in(B) and out(B) columns and keep everything else the same.

msbutler

Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens and @msbutler)

event.go line 59 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

Perhaps its also worth bikeshedding how we'd want to modify the window logger. I think the current impl looks like this: b5deee5

One nice quality of this format is that it cleanly partitions the kinds of compactions. While ML currently can only get triggered in a default compaction, we could enable it for move and read triggered compactions in the future if we wanted to. I propose adding an extra(B) column between the in(B) and out(B) columns and keep everything else the same.

^well actually, with the given window format, the in(B) column represents the sum of bytes over all input files and out(B) represents the size of the new file created from the compaction. Given this, it wouldn't make much sense to just log the bytes contributed from the extra level.

We could split in(B) into 3 columns (top level, extra level, output level), but that seems out of scope in this PR.

For this PR, is it alright if I just update the tests and ensure ML compaction log lines get parsed correctly?

jbowens

Reviewable status: 0 of 20 files reviewed, 1 unresolved discussion (waiting on @msbutler)

event.go line 59 at r1 (raw file):

Previously, msbutler (Michael Butler) wrote…

^well actually, with the given window format, the in(B) column represents the sum of bytes over all input files and out(B) represents the size of the new file created from the compaction. Given this, it wouldn't make much sense to just log the bytes contributed from the extra level.

We could split in(B) into 3 columns (top level, extra level, output level), but that seems out of scope in this PR.

For this PR, is it alright if I just update the tests and ensure ML compaction log lines get parsed correctly?

Yeah, I think that's alright. Mind creating an issue for the follow up though?

msbutler · 2023-08-11T13:57:26Z

@sumeerbhola @jbowens RFAL! there were 2 minor test failures bc I didn't update the test case log lines correctly. I will update these after y'all have a read.

msbutler · 2023-08-11T13:58:33Z

tool/logs/testdata/compactions

@@ -120,7 +120,7 @@ reset



I will plan to update the logs in this test file if y'all are cool with my log line updates.

jbowens · 2023-08-30T01:16:14Z

Sorry for the delay here; hoping to review again tomorrow

jbowens

Reviewed 5 of 17 files at r1, 5 of 13 files at r2, 9 of 12 files at r3.
Reviewable status: 19 of 25 files reviewed, 5 unresolved discussions (waiting on @msbutler and @sumeerbhola)

compaction_picker.go line 1407 at r3 (raw file):

func (p *compactionPickerByScore) updatePickerMetrics(
	env compactionEnv, pc pickedCompaction, candInfo [7]candidateLevelInfo,

pickedCompaction is a hefty struct. Can we take this by pointer?

And in that case, should we even bother returning the value out too? Call sites are modifying the picked compaction that they pass in.

compaction_picker.go line 1412 at r3 (raw file):

	// candInfo is sorted by score, not by compaction level.
	infoByLevel := [7]candidateLevelInfo{}

nit: s/7/numLevels/

compaction_picker.go line 1763 at r3 (raw file):

) *pickedCompaction {
	pcMulti := pcOrig.clone()
	if !pcMulti.setupMultiLevelCandidate(opts, diskAvailBytes) {

I'm mildly concerned about the amount of extra work involved in cloning and re-constructing inputs. Compaction-picking holds d.mu, which prevents WAL rotations, in-progress flushes and compactions from completing, etc. How about leaving a TODO to microbenchmark and possibly restructure to limit duplicate work here?

options.go line 1077 at r3 (raw file):

		o.Experimental.CPUWorkPermissionGranter = defaultCPUWorkGranter{}
	}
	if o.Experimental.MultiLevelCompactionHueristic == nil {

nit: I'm noticing now that heuristic is misspelled here.

msbutler

Reviewable status: 9 of 25 files reviewed, 4 unresolved discussions (waiting on @jbowens and @sumeerbhola)

compaction_picker.go line 1407 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

pickedCompaction is a hefty struct. Can we take this by pointer?

And in that case, should we even bother returning the value out too? Call sites are modifying the picked compaction that they pass in.

Good idea. Done.

compaction_picker.go line 1412 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/7/numLevels/

Done.

compaction_picker.go line 1763 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm mildly concerned about the amount of extra work involved in cloning and re-constructing inputs. Compaction-picking holds d.mu, which prevents WAL rotations, in-progress flushes and compactions from completing, etc. How about leaving a TODO to microbenchmark and possibly restructure to limit duplicate work here?

Added the todo

event.go line 59 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I think that's alright. Mind creating an issue for the follow up though?

Done. #2814

options.go line 1077 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I'm noticing now that heuristic is misspelled here.

heh. fixed.

tool/logs/testdata/compactions line 120 at r3 (raw file):

Previously, msbutler (Michael Butler) wrote…

I will plan to update the logs in this test file if y'all are cool with my log line updates.

Just Updated.

jbowens

Reviewed 1 of 7 files at r5, all commit messages.
Reviewable status: 10 of 25 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)

This patch teaches pebble to pick a compaction involving 3 levels based on a new WriteAmp heuristic, so long as the input level is below L0. Recall that pebble#1603 introduced the mechanics to conduct a multilevel compaction by teaching the compaction picker to call setupInputs() on the original output level, finding files to include in a third level. The new WriteAmp heuristic will choose this 3 level compaction if its predicted WriteAmp is less than the original compaction's predicted WriteAmp. Multilevel compactions will be considered for default and manual compactions, but not read triggered compactions. Predicted WriteAmp equals: = sum(size of all input files)/sum(size of all files from higher levels) According to several write intensive experiments, enabling the WriteAmp heuristic has the potential to slightly reduce overall WriteAmplification, increase throughput, and prevent inverted LSMs. - See results from running pebble ycsb F with 64, 96, and 124 workers: cockroachdb#2179 (comment) - See write up from running the replay tool with kv0 https://docs.google.com/document/d/1OmerofbtQfZUWQqLklqwmDJdVXI6qJzCLZQjsGirmFk/edit Future work could tweak the heuristic or enable compactions with more than 3 levels. Informs cockroachdb#136 Release note: none

msbutler · 2023-09-06T16:22:13Z

rebased to address a few merge conflicts. nothing complicated.

msbutler self-assigned this Aug 4, 2023

msbutler force-pushed the butler-merge-ml branch from aa1756e to 773965e Compare August 4, 2023 17:29

This was referenced Aug 4, 2023

compaction: introduce multilevel write-amp heuristic #2179

Closed

compaction: propose multi level compaction heursitics #1746

Closed

msbutler commented Aug 4, 2023

View reviewed changes

testdata/singledel_manual_compaction_set_with_del Outdated Show resolved Hide resolved

jbowens reviewed Aug 7, 2023

View reviewed changes

msbutler commented Aug 9, 2023

View reviewed changes

msbutler force-pushed the butler-merge-ml branch 3 times, most recently from 411bbf4 to 8418b20 Compare August 9, 2023 12:18

msbutler commented Aug 9, 2023

View reviewed changes

msbutler force-pushed the butler-merge-ml branch from 8418b20 to be40805 Compare August 9, 2023 22:16

jbowens reviewed Aug 10, 2023

View reviewed changes

msbutler mentioned this pull request Aug 10, 2023

tools/logs: split in(B) column into in_top(B), in_bottom(B) #2814

Closed

msbutler force-pushed the butler-merge-ml branch 3 times, most recently from d54db2e to 12d6d95 Compare August 11, 2023 12:05

msbutler marked this pull request as ready for review August 11, 2023 12:10

msbutler force-pushed the butler-merge-ml branch from 12d6d95 to deeb6ed Compare August 11, 2023 13:37

msbutler requested a review from sumeerbhola August 11, 2023 13:55

msbutler commented Aug 11, 2023

View reviewed changes

msbutler added the T-storage label Aug 11, 2023

jbowens reviewed Aug 31, 2023

View reviewed changes

msbutler force-pushed the butler-merge-ml branch from deeb6ed to 87be5a4 Compare September 1, 2023 17:49

msbutler commented Sep 1, 2023

View reviewed changes

msbutler force-pushed the butler-merge-ml branch from 87be5a4 to d5d073c Compare September 1, 2023 17:56

jbowens approved these changes Sep 6, 2023

View reviewed changes

msbutler force-pushed the butler-merge-ml branch from d5d073c to 8679e19 Compare September 6, 2023 16:21

msbutler merged commit 4e6810e into cockroachdb:master Sep 6, 2023

msbutler deleted the butler-merge-ml branch September 6, 2023 16:37

This was referenced Sep 6, 2023

compaction: explore predicted write ampflication heuristic for multilevel compaction #2053

Closed

compaction: write microbenchmark for compaction picking #2883

Closed

msbutler mentioned this pull request Dec 13, 2023

perf: investigate N-level compactions (N > 2) #136

Closed

Conversation

msbutler commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Aug 4, 2023

Uh oh!

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

msbutler left a comment

Choose a reason for hiding this comment

Uh oh!

msbutler left a comment

Choose a reason for hiding this comment

Uh oh!

msbutler left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

msbutler commented Aug 11, 2023

Uh oh!

msbutler Aug 11, 2023

Choose a reason for hiding this comment

Uh oh!

jbowens commented Aug 30, 2023

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

msbutler left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

msbutler commented Sep 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

msbutler commented Aug 4, 2023 •

edited

Loading