compaction: introduce multilevel write amp hueristic#2787
compaction: introduce multilevel write amp hueristic#2787msbutler merged 1 commit intocockroachdb:masterfrom
Conversation
aa1756e to
773965e
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 17 files reviewed, 5 unresolved discussions (waiting on @msbutler)
event.go line 59 at r1 (raw file):
redact.Safe(formatFileNums(i.Tables)), redact.Safe(humanize.Bytes.Uint64(tablesTotalSize(i.Tables))), redact.Safe(i.Score))
I'm not sure if you're intending to keep this in when merged (I'm not opposed), but any updates to the compaction log entries require corresponding updates in the compaction log-parsing tool: https://github.com/cockroachdb/pebble/blob/master/tool/logs/compaction.go
I filed #2796 for the testing gap here.
metrics.go line 75 at r1 (raw file):
TablesMoved uint64 // BytesInTopML are the total bytes in a multilevel compaction coming from the top level.
nit: how about grouping within an anonymous struct, eg
MultiLevel struct {
// BytesInTop ...
BytesInTop uint64
// BytesInTop ...
BytesIn uint64
// BytesRead ...
BytesRead uint64
}
metrics.go line 147 at r1 (raw file):
ReadCount int64 RewriteCount int64 SingleLevelCount int64
can the single-level count be computed from the other counts?
options.go line 599 at r1 (raw file):
MultiLevelCompactionHueristic MultiLevelHeuristic MultiLevelCompactionAllowL0 bool
nit: What do you think of stashing this on the WriteAmpHeuristic/MultiLevelHeuristic?
testdata/singledel_manual_compaction_set_with_del line 178 at r1 (raw file):
Previously, msbutler (Michael Butler) wrote…
@jbowens while cleaning up the ML PR, I noticed the need to change the results in this test case. Are ML compactions violating some invariant here? Note that this test still fails on some later line.
I don't think any invariants are being violated here, but this change from a L2->L3 compaction into a L2+L3->L4 multi-level compactions causes this test to no longer exercise the intended test case (a DEL and a SINGLEDEL meeting in a compaction). I think we should add a define flag that disables multi-level compactions and use it in this test (at line 147) keep the old picked compactions.
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 17 files reviewed, 5 unresolved discussions (waiting on @jbowens and @msbutler)
event.go line 59 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I'm not sure if you're intending to keep this in when merged (I'm not opposed), but any updates to the compaction log entries require corresponding updates in the compaction log-parsing tool: https://github.com/cockroachdb/pebble/blob/master/tool/logs/compaction.go
I filed #2796 for the testing gap here.
Ah good point. Before I go make those changes, do you have any suggestions for how I've modified the compaction log lines?
A default compaction now displays the compaction score of each input level and the single and ML overlapping ratio. As an example, here's how a single level compaction's logs line currently look in my PR:
[JOB 92] compacting(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99; OverlappingRatio: Single 8.03, Multi 25.05;
[JOB 92] compacted(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99 -> L5 [250871] (7.1MB), in 0.1s (0.1s total), output rate 111MB/s
And a multi level:
[JOB 11] compacting(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00; OverlappingRatio: Single 3.77, Multi 1.46;
[JOB 11] compacted(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00 -> L4 [250859 250860 250861 250862 250863] (46MB), in 0.2s (0.2s total), output rate 185MB/s
metrics.go line 75 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: how about grouping within an anonymous struct, eg
MultiLevel struct { // BytesInTop ... BytesInTop uint64 // BytesInTop ... BytesIn uint64 // BytesRead ... BytesRead uint64 }
good point. done.
metrics.go line 147 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
can the single-level count be computed from the other counts?
Deleted. Wasn't really necessary.
options.go line 599 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: What do you think of stashing this on the
WriteAmpHeuristic/MultiLevelHeuristic?
Good nit. Fixed.
testdata/singledel_manual_compaction_set_with_del line 178 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I don't think any invariants are being violated here, but this change from a L2->L3 compaction into a L2+L3->L4 multi-level compactions causes this test to no longer exercise the intended test case (a
DELand aSINGLEDELmeeting in a compaction). I think we should add adefineflag that disables multi-level compactions and use it in this test (at line 147) keep the old picked compactions.
Sounds good. Added that define flag.
411bbf4 to
8418b20
Compare
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens and @msbutler)
event.go line 59 at r1 (raw file):
Previously, msbutler (Michael Butler) wrote…
Ah good point. Before I go make those changes, do you have any suggestions for how I've modified the compaction log lines?
A default compaction now displays the compaction score of each input level and the single and ML overlapping ratio. As an example, here's how a single level compaction's logs line currently look in my PR:
[JOB 92] compacting(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99; OverlappingRatio: Single 8.03, Multi 25.05; [JOB 92] compacted(default) L4 [250870] (869KB) Score=1.01 + L5 [250859] (6.8MB) Score=0.99 -> L5 [250871] (7.1MB), in 0.1s (0.1s total), output rate 111MB/sAnd a multi level:
[JOB 11] compacting(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00; OverlappingRatio: Single 3.77, Multi 1.46; [JOB 11] compacted(default) [multilevel] L2 [250858] (2.1MB) Score=1.09 + L3 [247985 247989 247848] (17MB) Score=0.99 + L4 [250817 250834 238396] (28MB) Score=1.00 -> L4 [250859 250860 250861 250862 250863] (46MB), in 0.2s (0.2s total), output rate 185MB/s
Perhaps its also worth bikeshedding how we'd want to modify the window logger. I think the current impl looks like this: b5deee5
One nice quality of this format is that it cleanly partitions the kinds of compactions. While ML currently can only get triggered in a default compaction, we could enable it for move and read triggered compactions in the future if we wanted to. I propose adding an extra(B) column between the in(B) and out(B) columns and keep everything else the same.
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 19 files reviewed, 3 unresolved discussions (waiting on @jbowens and @msbutler)
event.go line 59 at r1 (raw file):
Previously, msbutler (Michael Butler) wrote…
Perhaps its also worth bikeshedding how we'd want to modify the window logger. I think the current impl looks like this: b5deee5
One nice quality of this format is that it cleanly partitions the kinds of compactions. While ML currently can only get triggered in a default compaction, we could enable it for move and read triggered compactions in the future if we wanted to. I propose adding an
extra(B)column between thein(B)andout(B)columns and keep everything else the same.
^well actually, with the given window format, the in(B) column represents the sum of bytes over all input files and out(B) represents the size of the new file created from the compaction. Given this, it wouldn't make much sense to just log the bytes contributed from the extra level.
We could split in(B) into 3 columns (top level, extra level, output level), but that seems out of scope in this PR.
For this PR, is it alright if I just update the tests and ensure ML compaction log lines get parsed correctly?
8418b20 to
be40805
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 20 files reviewed, 1 unresolved discussion (waiting on @msbutler)
event.go line 59 at r1 (raw file):
Previously, msbutler (Michael Butler) wrote…
^well actually, with the given window format, the in(B) column represents the sum of bytes over all input files and out(B) represents the size of the new file created from the compaction. Given this, it wouldn't make much sense to just log the bytes contributed from the extra level.
We could split in(B) into 3 columns (top level, extra level, output level), but that seems out of scope in this PR.
For this PR, is it alright if I just update the tests and ensure ML compaction log lines get parsed correctly?
Yeah, I think that's alright. Mind creating an issue for the follow up though?
d54db2e to
12d6d95
Compare
12d6d95 to
deeb6ed
Compare
|
@sumeerbhola @jbowens RFAL! there were 2 minor test failures bc I didn't update the test case log lines correctly. I will update these after y'all have a read. |
| @@ -120,7 +120,7 @@ reset | |||
|
|
|||
There was a problem hiding this comment.
I will plan to update the logs in this test file if y'all are cool with my log line updates.
|
Sorry for the delay here; hoping to review again tomorrow |
jbowens
left a comment
There was a problem hiding this comment.
Reviewed 5 of 17 files at r1, 5 of 13 files at r2, 9 of 12 files at r3.
Reviewable status: 19 of 25 files reviewed, 5 unresolved discussions (waiting on @msbutler and @sumeerbhola)
compaction_picker.go line 1407 at r3 (raw file):
func (p *compactionPickerByScore) updatePickerMetrics( env compactionEnv, pc pickedCompaction, candInfo [7]candidateLevelInfo,
pickedCompaction is a hefty struct. Can we take this by pointer?
And in that case, should we even bother returning the value out too? Call sites are modifying the picked compaction that they pass in.
compaction_picker.go line 1412 at r3 (raw file):
// candInfo is sorted by score, not by compaction level. infoByLevel := [7]candidateLevelInfo{}
nit: s/7/numLevels/
compaction_picker.go line 1763 at r3 (raw file):
) *pickedCompaction { pcMulti := pcOrig.clone() if !pcMulti.setupMultiLevelCandidate(opts, diskAvailBytes) {
I'm mildly concerned about the amount of extra work involved in cloning and re-constructing inputs. Compaction-picking holds d.mu, which prevents WAL rotations, in-progress flushes and compactions from completing, etc. How about leaving a TODO to microbenchmark and possibly restructure to limit duplicate work here?
options.go line 1077 at r3 (raw file):
o.Experimental.CPUWorkPermissionGranter = defaultCPUWorkGranter{} } if o.Experimental.MultiLevelCompactionHueristic == nil {
nit: I'm noticing now that heuristic is misspelled here.
deeb6ed to
87be5a4
Compare
msbutler
left a comment
There was a problem hiding this comment.
Reviewable status: 9 of 25 files reviewed, 4 unresolved discussions (waiting on @jbowens and @sumeerbhola)
compaction_picker.go line 1407 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
pickedCompactionis a hefty struct. Can we take this by pointer?And in that case, should we even bother returning the value out too? Call sites are modifying the picked compaction that they pass in.
Good idea. Done.
compaction_picker.go line 1412 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: s/7/numLevels/
Done.
compaction_picker.go line 1763 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
I'm mildly concerned about the amount of extra work involved in cloning and re-constructing inputs. Compaction-picking holds d.mu, which prevents WAL rotations, in-progress flushes and compactions from completing, etc. How about leaving a TODO to microbenchmark and possibly restructure to limit duplicate work here?
Added the todo
event.go line 59 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Yeah, I think that's alright. Mind creating an issue for the follow up though?
Done. #2814
options.go line 1077 at r3 (raw file):
Previously, jbowens (Jackson Owens) wrote…
nit: I'm noticing now that heuristic is misspelled here.
heh. fixed.
tool/logs/testdata/compactions line 120 at r3 (raw file):
Previously, msbutler (Michael Butler) wrote…
I will plan to update the logs in this test file if y'all are cool with my log line updates.
Just Updated.
87be5a4 to
d5d073c
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewed 1 of 7 files at r5, all commit messages.
Reviewable status: 10 of 25 files reviewed, 1 unresolved discussion (waiting on @sumeerbhola)
This patch teaches pebble to pick a compaction involving 3 levels based on a new WriteAmp heuristic, so long as the input level is below L0. Recall that pebble#1603 introduced the mechanics to conduct a multilevel compaction by teaching the compaction picker to call setupInputs() on the original output level, finding files to include in a third level. The new WriteAmp heuristic will choose this 3 level compaction if its predicted WriteAmp is less than the original compaction's predicted WriteAmp. Multilevel compactions will be considered for default and manual compactions, but not read triggered compactions. Predicted WriteAmp equals: = sum(size of all input files)/sum(size of all files from higher levels) According to several write intensive experiments, enabling the WriteAmp heuristic has the potential to slightly reduce overall WriteAmplification, increase throughput, and prevent inverted LSMs. - See results from running pebble ycsb F with 64, 96, and 124 workers: cockroachdb#2179 (comment) - See write up from running the replay tool with kv0 https://docs.google.com/document/d/1OmerofbtQfZUWQqLklqwmDJdVXI6qJzCLZQjsGirmFk/edit Future work could tweak the heuristic or enable compactions with more than 3 levels. Informs cockroachdb#136 Release note: none
d5d073c to
8679e19
Compare
|
rebased to address a few merge conflicts. nothing complicated. |
This patch teaches pebble to pick a compaction involving 3 levels based on a
new WriteAmp heuristic, so long as the input level is below L0. Recall that
pebble#1603 introduced the mechanics to conduct a multilevel compaction by
teaching the compaction picker to call setupInputs() on the original output
level, finding files to include in a third level. The new WriteAmp heuristic
will choose this 3 level compaction if its predicted WriteAmp is less than the
original compaction's predicted WriteAmp. Multilevel compactions will be
considered for default and manual compactions, but not read triggered
compactions.
Predicted WriteAmp equals:
= sum(size of all input files)/sum(size of all files from higher levels)
According to several write intensive experiments, enabling the WriteAmp
heuristic has the potential to slightly reduce overall WriteAmplification,
increase throughput, and prevent inverted LSMs.
compaction: introduce multilevel write-amp heuristic #2179 (comment)
https://docs.google.com/document/d/1OmerofbtQfZUWQqLklqwmDJdVXI6qJzCLZQjsGirmFk/edit
Future work could tweak the heuristic or enable compactions with more than 3 levels.
Informs #136
Release note: none