Skip to content

*: Enable flush splits in L0 #675

Merged
itsbilal merged 1 commit intocockroachdb:masterfrom
itsbilal:l0-sublevels-flush-splits
Jun 10, 2020
Merged

*: Enable flush splits in L0 #675
itsbilal merged 1 commit intocockroachdb:masterfrom
itsbilal:l0-sublevels-flush-splits

Conversation

@itsbilal
Copy link
Copy Markdown
Contributor

@itsbilal itsbilal commented May 15, 2020

This change updates L0 flush logic to split SSTables when writing
to L0 along flushSplitKeys as calculated by L0SubLevels. This
should lead to fewer wide SSTables in practice. The required
relaxation of L0 invariants (to allow for SSTs that overlap in
sequence numbers) has happened in past changes, however some
additional invariants had to be relaxed as the metamorphic test
caught some cases around narrow flushes that would trigger
CheckOrdering.

A new option, FlushSplitBytes, sets the target number of bytes
to aim for in each flush split interval. Defaults to 0, or flush
splitting disabled.

@itsbilal itsbilal requested review from jbowens and petermattis May 15, 2020 22:00
@itsbilal itsbilal self-assigned this May 15, 2020
@petermattis
Copy link
Copy Markdown
Collaborator

This change is Reviewable

@itsbilal
Copy link
Copy Markdown
Contributor Author

This ended up being larger in scope than I imagined, and at this point I'm convinced we need to update findGrandparentLimits / findL0Limits to not split user keys across multiple SSTs to begin with. That, or L0SubLevels needs to learn about atomic compaction units.

The metamorphic test is failing as a result of that, and that's why this PR is a WIP. However that aside, most of the compaction*.go changes should be good to go.

@itsbilal
Copy link
Copy Markdown
Contributor Author

Also: we can no longer distinguish between single-seqnum flushes and ingestions at all, so CheckOrdering had to be relaxed even more than I did earlier. This change might need a backport to 20.1.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ended up being larger in scope than I imagined, and at this point I'm convinced we need to update findGrandparentLimits / findL0Limits to not split user keys across multiple SSTs to begin with. That, or L0SubLevels needs to learn about atomic compaction units.

Yes, we can't split user keys across multiple SSTs. My changes in https://reviewable.io/reviews/cockroachdb/pebble/563/compaction.go#- were also ensuring that.

But that is insufficient since we don't control the sub-level of the output (unlike normal compactions where all the output files are going to the same level) -- it will be determined organically based on just looking at file boundaries.
This was what tripped this code up in the metamorphic test, as we were discussing on slack (so just writing this for the other reviewers). For normal compactions we do not truncate the end of a range tombstone since the same user key can span multiple output ssts. Here we do need to truncate the end of a range tombstone so it behaves in a clean manner. We can do this clean truncation because the user keys are not spanning multiple ssts. The additional code in the prototype for this was Fragmenter.FlushToNoStraddling https://reviewable.io/reviews/cockroachdb/pebble/563/internal/rangedel/fragmenter.go#-

Reviewable status: 0 of 25 files reviewed, all discussions resolved (waiting on @jbowens and @petermattis)

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch from b35da51 to edf60a8 Compare May 19, 2020 22:49
@itsbilal
Copy link
Copy Markdown
Contributor Author

Thanks for the explanation @sumeerbhola! That makes sense. I added end-key rangedel truncation just like your prototype, and also re-enabled grandparent-based limits like I mentioned.

This is ready for a review now, save for additional tests around FlushToNoStraddling which I'll add soon.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch 4 times, most recently from 67fedb4 to 5d12e18 Compare May 20, 2020 16:26
@itsbilal
Copy link
Copy Markdown
Contributor Author

This is ready for a review now; I've added tests. Thanks!

First commit is #670 so review all the commits after that.

@itsbilal itsbilal changed the title [WIP] *: Enable flush splits in L0 *: Enable flush splits in L0 May 20, 2020
@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch 2 times, most recently from 19aa3cb to d143927 Compare May 21, 2020 14:42
@itsbilal
Copy link
Copy Markdown
Contributor Author

Now that #670 is merged, the first commit is no longer from that PR.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch from d143927 to 2451327 Compare May 21, 2020 16:34
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, 2 of 2 files at r2.
Reviewable status: 5 of 27 files reviewed, 8 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


compaction.go, line 139 at r3 (raw file):

	// Boundaries at which flushes to L0 should be split. Determined by
	// L0SubLevels. If nil, flushes aren't split.
	l0Limits [][]byte

Can this be unified with grandparents in some way? It is messy to have both grandparents and l0Limits present in the compaction struct. Feels like a compaction should be parameterized on a "splitter".


compaction.go, line 520 at r3 (raw file):

}

// findL0Limit takes the start key for a table and return the user key to which

s/return/returns/g


compaction.go, line 1536 at r3 (raw file):

			// nil. In either case, the inner loop will execute at least once to
			// process `key`, and the input iterator will be advanced.
			if splittingFlush {

This adds weight to my earlier comment. Rather than splittingFlush, I think we need to identify a splitter interface which is initialized when the compaction is constructed.


level_checker.go, line 370 at r2 (raw file):

	current := c.readState.current
	for i := len(current.Files[0]) - 1; i >= 0; i-- {

Can you expand on why this is necessary? I don't understand why this change was needed for correctness. My mental model is that it should be fine, though inefficient, to ignore the sublevels in L0. If we require knowledge of the sublevels here and on read paths, are we violating backwards compatibility with RocksDB?


options.go, line 309 at r3 (raw file):

	// FlushSplitBytes denotes the number of bytes to aim to have in each
	// flush split interval.
	FlushSplitBytes int64

Is there any guidance on what to set this value to? Should it be based on the target file size of L1?


internal/manifest/version.go, line 353 at r1 (raw file):

// InitL0Sublevels initializes the L0SubLevels
func (v *Version) InitL0Sublevels(cmp Compare, formatKey base.FormatKey, flushSplitBytes int64) error {

This flushSplitBytes stuff looks like it is in the wrong commit.


internal/manifest/version.go, line 571 at r1 (raw file):

		//   or flushed files. We cannot tell the difference between them based on FileMetadata,
		//   so our consistency checking here uses the weaker checks assuming it is a narrow
		//   flushed file.

Can you expand this comment to indicate why we can't perform the ingested sstable checks?


internal/rangedel/fragmenter.go, line 352 at r3 (raw file):

//            g----k#10
//
func (f *Fragmenter) FlushToNoStraddling(key []byte) {

I'm not fond of the NoStraddling terminology here. Given that the bulk of the work is done by truncateAndFlush, can this simply be the exported version and named TruncateAndFlush.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch 3 times, most recently from ddcdf77 to a07d68e Compare May 26, 2020 21:32
Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 29 files reviewed, 8 unresolved discussions (waiting on @jbowens and @petermattis)


compaction.go, line 139 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can this be unified with grandparents in some way? It is messy to have both grandparents and l0Limits present in the compaction struct. Feels like a compaction should be parameterized on a "splitter".

I think it's worthwhile to consider accounting for grandparents when calculating L0 limits to begin with - but the additional complexity to handle both types of splits is not significant enough to make a change.


compaction.go, line 520 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/return/returns/g

Done.


compaction.go, line 1536 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This adds weight to my earlier comment. Rather than splittingFlush, I think we need to identify a splitter interface which is initialized when the compaction is constructed.

See my latest change; we actually need to special-case splittingFlush even more than usual, due to subtle differences when always forcing clean user key splits across sstables, and how that affects this code.


level_checker.go, line 370 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you expand on why this is necessary? I don't understand why this change was needed for correctness. My mental model is that it should be fine, though inefficient, to ignore the sublevels in L0. If we require knowledge of the sublevels here and on read paths, are we violating backwards compatibility with RocksDB?

It makes the level checker useful to detect any invariant violations in a world with sublevels and flush splits. It's not outright necessary, but the newer error messages are more accurate in pinpointing any issues with sublevel construction.

There's no backward compatibility issue as the old style flush / L0 structure is logically equivalent to having one sublevel for each flush, ordered by sequence numbers. The error messages will differ under the new invariant checker, but it won't catch any issues in a correct legacy L0 setup.


options.go, line 309 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is there any guidance on what to set this value to? Should it be based on the target file size of L1?

I was wondering if it could just be the target file size for L0 (or half the one for L1)? It's a fairly large one right now (10mb) to only kick in when L0 is under pressure.


internal/manifest/version.go, line 353 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This flushSplitBytes stuff looks like it is in the wrong commit.

Done.


internal/manifest/version.go, line 571 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Can you expand this comment to indicate why we can't perform the ingested sstable checks?

Done. Basically, the ingested file checks assumed an ingested file would have a global seqnum that isn't coincident with any of the other (flushed) files.Now that it's possible for an sstable to just have one truncated tombstone (and so a seqnum that appears to be a global-like one) that also shares the sequence number with keys in other files, we can no longer do this check anymore.


internal/rangedel/fragmenter.go, line 352 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not fond of the NoStraddling terminology here. Given that the bulk of the work is done by truncateAndFlush, can this simply be the exported version and named TruncateAndFlush.

Done. Went with TruncateAndFlushTo to have the same suffix as FlushTo, which is a sister method.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch from a07d68e to ee13fa3 Compare May 26, 2020 22:23
Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ready for a review again. Had to fix bugs in cases involving L0 limits falling right at user keys that have range tombstones and point keys (which were originally ending up in different sstables if the target file size was just crossed). Also made the limit exclusive in the left-hand SST as opposed to inclusive. I'm happy to expand on the code comments if they aren't clear in explaining the rationale behind it all.

Reviewable status: 2 of 29 files reviewed, 8 unresolved discussions (waiting on @jbowens and @petermattis)

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 29 files reviewed, 10 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


compaction.go, line 139 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I think it's worthwhile to consider accounting for grandparents when calculating L0 limits to begin with - but the additional complexity to handle both types of splits is not significant enough to make a change.

I didn't mean to suggest that we handle both grandparent and L0 splits. My suggestion was about unifying the two separate split mechanisms. Right now, a compaction either uses grandparents or l0Limits, but not both. We have the concrete implementations of both approaches realized in these two fields. Can we instead of some sort of splitter interface:

type splitter interface {
  shouldSplit(key []byte) bool
}

And replace the grandparents and l0Limits with a single splitter field. (I doubt the splitter interface I just described is sufficient).


compaction.go, line 1536 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

See my latest change; we actually need to special-case splittingFlush even more than usual, due to subtle differences when always forcing clean user key splits across sstables, and how that affects this code.

The special-cases don't negate the idea of a splitter interface. Now that you're combining grandparent splitting and L0 splitting, a composable splitter interface makes even more sense.


compaction.go, line 525 at r6 (raw file):

// this function passes through to findGrandparentLimit.
func (c *compaction) findL0Limit(start []byte) []byte {
	if c.startLevel > -1 || c.outputLevel != 0 || len(c.l0Limits) == 0 {

I think you only call findL0Limit if c.startLevel < 0 && c.outputLevel == 0.


compaction.go, line 526 at r6 (raw file):

func (c *compaction) findL0Limit(start []byte) []byte {
	if c.startLevel > -1 || c.outputLevel != 0 || len(c.l0Limits) == 0 {
		return c.findGrandparentLimit(start)

Before this change, compaction.grandparents would not be populated for flushes. I think it is now, which means we'll split flushes on grandparent boundaries now. Is that change intentional?


compaction.go, line 1553 at r6 (raw file):

				// the next range tombstone to write does not start until after
				// the L0 split point.
				startKey := c.rangeDelFrag.Start()

Is this the same as ve.NewFiles[n-1].Meta.Largest.UserKey? I'm trying to understand why the flush splitting has different handling of range tombstones that the compaction splitting.


compaction.go, line 1658 at r6 (raw file):

					break
				}
				// We do not split user keys when splitting flushes. In that

We should explain why we don't split on user keys during flushes. Splitting user keys during compactions is actually not useful and we only do so because RocksDB does so, though the resulting sstables are considered an "atomic compaction unit" negating the benefits of the split. See this TODO about findGrandparentLimit:

// TODO(peter): Stopping compaction output in the middle of a user-key creates
// 2 sstables that need to be compacted together as an "atomic compaction
// unit". This is unfortunate as it removes the benefit of stopping output to
// an sstable in order to prevent a large compaction with the next level. Seems
// better to adjust findGrandparentLimit to not stop output in the middle of a
// user-key. Perhaps this isn't a problem if the compaction picking heuristics
// always pick the right (older) sibling for compaction first.

level_checker.go, line 370 at r2 (raw file):
Yes, I can see why this would make the level checker more useful, but the commit message for this change says:

Running tests with flush splitting enabled tripped up some
invariant checks in level_checker, because level_checker wasn't
exactly sublevel-aware. Update level_checker to build its
MergingIter just like how the rest of pebble does.

That makes me think the change is necessary.


options.go, line 309 at r3 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I was wondering if it could just be the target file size for L0 (or half the one for L1)? It's a fairly large one right now (10mb) to only kick in when L0 is under pressure.

Let's add a TODO to figure this out. I'm also not sure if this is an option that needs to be exposed long-term, so let's move it into the Experimental struct for now.


internal/manifest/version.go, line 571 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done. Basically, the ingested file checks assumed an ingested file would have a global seqnum that isn't coincident with any of the other (flushed) files.Now that it's possible for an sstable to just have one truncated tombstone (and so a seqnum that appears to be a global-like one) that also shares the sequence number with keys in other files, we can no longer do this check anymore.

LG, though the commas in this sentence are unnecessary.


internal/rangedel/testdata/fragmenter_truncate_and_flush_to, line 22 at r6 (raw file):

----

# Call out of orderout of order

Something looks garbled here.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch 2 times, most recently from 9666eed to 5ddca2f Compare May 27, 2020 16:54
Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed all your comments except for the splitter one - going to do that next while I test/benchmark this in parallel. The rest is ready for review. PTAL and thanks for the review!

Reviewable status: 2 of 29 files reviewed, 10 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


compaction.go, line 139 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I didn't mean to suggest that we handle both grandparent and L0 splits. My suggestion was about unifying the two separate split mechanisms. Right now, a compaction either uses grandparents or l0Limits, but not both. We have the concrete implementations of both approaches realized in these two fields. Can we instead of some sort of splitter interface:

type splitter interface {
  shouldSplit(key []byte) bool
}

And replace the grandparents and l0Limits with a single splitter field. (I doubt the splitter interface I just described is sufficient).

Oh I was saying that we already handle both types of splits (see how findL0Limits falls through to the grandparent method in its implementation).


compaction.go, line 525 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you only call findL0Limit if c.startLevel < 0 && c.outputLevel == 0.

Yeah. It's more of a guard in case we didn't.


compaction.go, line 526 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Before this change, compaction.grandparents would not be populated for flushes. I think it is now, which means we'll split flushes on grandparent boundaries now. Is that change intentional?

Yes it's intentional. We want to avoid any excessive overlaps with the grandparent level. There was a TODO in newFlush about this.


compaction.go, line 1553 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is this the same as ve.NewFiles[n-1].Meta.Largest.UserKey? I'm trying to understand why the flush splitting has different handling of range tombstones that the compaction splitting.

It's not, in cases where there's a gap between the last written user key, and the start of the next range tombstone. The comment above is the motivating example on how we can get into an infinite loop if an example like that happens (as finding the limit after ve.NewFiles[n-1].Meta.Largest.UserKey would always result in the limit being b in that case). I'm surprised it never happens with the grandparent logic for non-flushes, or maybe it's less reproducible there. But it is definitely an issue that pops up with flush splits.


compaction.go, line 1658 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We should explain why we don't split on user keys during flushes. Splitting user keys during compactions is actually not useful and we only do so because RocksDB does so, though the resulting sstables are considered an "atomic compaction unit" negating the benefits of the split. See this TODO about findGrandparentLimit:

// TODO(peter): Stopping compaction output in the middle of a user-key creates
// 2 sstables that need to be compacted together as an "atomic compaction
// unit". This is unfortunate as it removes the benefit of stopping output to
// an sstable in order to prevent a large compaction with the next level. Seems
// better to adjust findGrandparentLimit to not stop output in the middle of a
// user-key. Perhaps this isn't a problem if the compaction picking heuristics
// always pick the right (older) sibling for compaction first.

Done.


level_checker.go, line 370 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Yes, I can see why this would make the level checker more useful, but the commit message for this change says:

Running tests with flush splitting enabled tripped up some
invariant checks in level_checker, because level_checker wasn't
exactly sublevel-aware. Update level_checker to build its
MergingIter just like how the rest of pebble does.

That makes me think the change is necessary.

Ah right, my bad. I wrote the commit message before I thought this through. Fixed.


options.go, line 309 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's add a TODO to figure this out. I'm also not sure if this is an option that needs to be exposed long-term, so let's move it into the Experimental struct for now.

Done.


internal/rangedel/testdata/fragmenter_truncate_and_flush_to, line 22 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Something looks garbled here.

Done.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first two commits, and the fragmenter changes should be broken out into separate PRs. Those changes are good to merge and I'd like to get them out of this PR to reduce the cognitive load. The flush splitting changes still need more work. The DB.runCompaction was already too complex and burdened with subtle conditionals before this PR.

Reviewable status: 2 of 29 files reviewed, 16 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


compaction.go, line 526 at r6 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yes it's intentional. We want to avoid any excessive overlaps with the grandparent level. There was a TODO in newFlush about this.

Ok. I was aware of the TODO, but now FlushSplitBytes == 0 doesn't seem to disable flush splits as it says in the doc comment.


compaction.go, line 1539 at r9 (raw file):

				limit = c.findL0Limit(key.UserKey)
			} else {
				// Use the start key of the first pending tombstone to find the

Similar to my comment in fragmenter.go, all pending tombstones have the same start key.


compaction.go, line 1585 at r9 (raw file):

		// Each inner loop iteration processes one key from the input iterator.
		passedGrandparentLimit := false

This variable should be renamed to passedLimit as it applies to both the grandparent limit and the L0 limit now.


compaction.go, line 1622 at r9 (raw file):

						keyCopy := make([]byte, len(key.UserKey))
						copy(keyCopy, key.UserKey)
						c.rangeDelFrag.TruncateAndFlushTo(keyCopy)

This is equivalent to c.rangeDelFrag.TruncateAndFlushTo(key.Clone().UserKey).


compaction.go, line 1666 at r9 (raw file):

				// are an unhelpful side-effects of regular compactions; and the
				// only reason it's done for regular compactions is to maintain
				// the same behaviour as RocksDB.

Given that we're diverging from RocksDB with L0 sublevels, I think it is worth considering diverging from RocksDB with regards to atomic compaction units. Is there a strong reason for compactions to allow splitting within a user key? We certainly need to maintain support for the concept of atomic compaction units when generating Ln compactions, and perhaps need an internal option for allowing construction of such sstables for testing purposes.


compaction.go, line 1672 at r9 (raw file):

				// and switch outputs when it iterates past this user key.
				limit = make([]byte, len(key.UserKey))
				copy(limit, key.UserKey)

limit = key.Clone().UserKey.

Shouldn't we set passedGrandparentLimit = false here? Otherwise we won't be checking limit on each loop iteration. I might just be getting confused by all the conditionals in this code.

Rather than making a copy of the key, could we add an sstable.Writer.LastUserKey() accessor? See https://github.com/cockroachdb/pebble/pull/536/files. Not sure if that is actually better or not. We'd need additional booleans here to indicate that the limit is now whenever the next key's user-key differs from the last key's user-key.


internal/rangedel/fragmenter.go, line 352 at r9 (raw file):

//            g----k#10
//
// WARNING: The fragmenter could hold on to the specified end key. Ensure it's

I think the comment above Fragmenter.Add is slightly more accurate as not only must the slice outlive the sstable, it must also not be modified.


internal/rangedel/fragmenter.go, line 379 at r9 (raw file):

			return
		}
		f.truncateAndFlush(key)

I think this is the source of the unexpectedly succeeding test cases. I think we should call f.truncateandFlush(key) unconditionally. This isn't necessary in Add because we're always adding a pending key.


internal/rangedel/fragmenter.go, line 383 at r9 (raw file):

}

// Start returns the start key of the first tombstone in the pending buffer,

This is the start key for all pending tombstones as all pending tombstones share the same start key. I think this should be clarified in the comment.


internal/rangedel/testdata/fragmenter_truncate_and_flush_to, line 13 at r9 (raw file):

build
truncate-and-flush-to c
1:  b--d

I'm surprised this isn't an error. We've flushed to c. We shouldn't be able to output a range tombstone that starts before c.


internal/rangedel/testdata/fragmenter_truncate_and_flush_to, line 19 at r9 (raw file):

build
truncate-and-flush-to c
truncate-and-flush-to b

I'm surprised this isn't an error. We've flushed to c. We shouldn't be able to subsequently to flush to a key before c.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch from 5ddca2f to 462c466 Compare May 28, 2020 22:56
Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I'll make a separate PR for the first two commits. Spent a solid chunk of today trying out a simplification of the outer/inner compaction loop with splitter interfaces, and learned some new cases where we should never be splitting (such as in a sequence of range deletion tombstone keys, which the current code handles in a subtle-but-correct way).

I'm still very much confident in the way the current code is written, and I've sprinkled some more Experimental.FlushSplitBytes > 0 checks to completely disable all new logic when that value is 0. But I do plan to restructure the current spaghetti-conditional; it's just a more nuanced process than I was hoping.

Reviewable status: 2 of 29 files reviewed, 16 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


compaction.go, line 526 at r6 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ok. I was aware of the TODO, but now FlushSplitBytes == 0 doesn't seem to disable flush splits as it says in the doc comment.

Good point. I've added FlushSplitBytes > 0 to splittingFlush to help.


compaction.go, line 1539 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Similar to my comment in fragmenter.go, all pending tombstones have the same start key.

Done.


compaction.go, line 1585 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This variable should be renamed to passedLimit as it applies to both the grandparent limit and the L0 limit now.

It doesn't, since L0 cases will just hit the inner conditional and always break out of the loop.


compaction.go, line 1622 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is equivalent to c.rangeDelFrag.TruncateAndFlushTo(key.Clone().UserKey).

Done.


compaction.go, line 1666 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Given that we're diverging from RocksDB with L0 sublevels, I think it is worth considering diverging from RocksDB with regards to atomic compaction units. Is there a strong reason for compactions to allow splitting within a user key? We certainly need to maintain support for the concept of atomic compaction units when generating Ln compactions, and perhaps need an internal option for allowing construction of such sstables for testing purposes.

I can see where the need is coming from. I do like the idea of always splitting at user key boundaries only, and having all split cases (including file sizes) just throw up a flag to split on the next user key change. I'll prototype this idea this friday.


compaction.go, line 1672 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

limit = key.Clone().UserKey.

Shouldn't we set passedGrandparentLimit = false here? Otherwise we won't be checking limit on each loop iteration. I might just be getting confused by all the conditionals in this code.

Rather than making a copy of the key, could we add an sstable.Writer.LastUserKey() accessor? See https://github.com/cockroachdb/pebble/pull/536/files. Not sure if that is actually better or not. We'd need additional booleans here to indicate that the limit is now whenever the next key's user-key differs from the last key's user-key.

Yeah, we'd need that additional boolean if we had that accessor. I'm not a fan of the double-use of limit here at all (sometimes it is used for comparisons like here, other times it's set for being passed onto finishOutput as the "next key", which isn't the same thing always), and for a large part of today I tried to simplify its uses (unsuccessfully).

We don't need to set passedGrandparentLimit = false because splittingFlush = true implies that it will never be true.


internal/rangedel/fragmenter.go, line 352 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think the comment above Fragmenter.Add is slightly more accurate as not only must the slice outlive the sstable, it must also not be modified.

Done.


internal/rangedel/fragmenter.go, line 379 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think this is the source of the unexpectedly succeeding test cases. I think we should call f.truncateandFlush(key) unconditionally. This isn't necessary in Add because we're always adding a pending key.

Done.


internal/rangedel/fragmenter.go, line 383 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is the start key for all pending tombstones as all pending tombstones share the same start key. I think this should be clarified in the comment.

Done.


internal/rangedel/testdata/fragmenter_truncate_and_flush_to, line 13 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm surprised this isn't an error. We've flushed to c. We shouldn't be able to output a range tombstone that starts before c.

Right. Fixing.


internal/rangedel/testdata/fragmenter_truncate_and_flush_to, line 19 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm surprised this isn't an error. We've flushed to c. We shouldn't be able to subsequently to flush to a key before c.

Done.

@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch 2 times, most recently from 455ddc6 to 0b9ca2c Compare May 29, 2020 22:32
Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly there! Per offline discussion, I'm good with this going in as-is but following up with additional PRs that adjust the compaction structure and abstractions to reduce the conditions in runCompaction.

Reviewable status: 0 of 24 files reviewed, 6 unresolved discussions (waiting on @itsbilal, @jbowens, and @petermattis)


compaction.go, line 1666 at r9 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I can see where the need is coming from. I do like the idea of always splitting at user key boundaries only, and having all split cases (including file sizes) just throw up a flag to split on the next user key change. I'll prototype this idea this friday.

Let's file an issue about this. I keep forgetting about it.


compaction.go, line 1672 at r9 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Yeah, we'd need that additional boolean if we had that accessor. I'm not a fan of the double-use of limit here at all (sometimes it is used for comparisons like here, other times it's set for being passed onto finishOutput as the "next key", which isn't the same thing always), and for a large part of today I tried to simplify its uses (unsuccessfully).

We don't need to set passedGrandparentLimit = false because splittingFlush = true implies that it will never be true.

These two lines can be replaced with limit = key.Clone().UserKey.


compaction.go, line 533 at r10 (raw file):

		return c.cmp(c.l0Limits[i], start) > 0
	})
	grandparentLimit := c.findGrandparentLimit(start)

See two calls to findGrandparentLimit in this function is a bit confusing.


compaction_iter.go, line 583 at r10 (raw file):

		i.rangeDelFrag.Finish()
	} else {
		if exclude {

Nit: This could be else if. Even better, I prefer using a switch statement when there are several conditionals:

switch {
case key == nil:
case exclude:
default:
}

I think this method deserves a comment. To be fair, it deserved a comment before this change.


options.go, line 293 at r10 (raw file):

	Experimental struct {
		// FlushSplitBytes denotes the number of bytes to aim to have in each
		// flush split interval. If set to 0, flushes are not split.

I think it is worth rephrasing this to indicate that it is sstables generated by a flush that are split:

"FlushSplitBytes denotes the target size for sstables generated during a flush. If set to 0, only a single sstable will be generated by a flush. Splitting sstables during flush allows increased compaction flexibility and concurrency when those tables are compacted to lower levels."


internal/manifest/version.go, line 571 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

LG, though the commas in this sentence are unnecessary.

Ping.

This change updates L0 flush logic to split SSTables when writing
to L0 along flushSplitKeys as calculated by L0SubLevels. This
should lead to fewer wide SSTables in practice. The required
relaxation of L0 invariants (to allow for SSTs that overlap in
sequence numbers) has happened in past changes. Range tombstones
are truncated at these split keys, using FlushToNoStraddling.

A new option, FlushSplitBytes, sets the target number of bytes
to aim for in each flush split interval. Defaults to 0, or flush
splitting disabled.
@itsbilal itsbilal force-pushed the l0-sublevels-flush-splits branch from 0b9ca2c to 2c1e0f0 Compare June 10, 2020 16:59
Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: 0 of 24 files reviewed, 5 unresolved discussions (waiting on @jbowens and @petermattis)


compaction.go, line 1666 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Let's file an issue about this. I keep forgetting about it.

Filing.


compaction.go, line 1672 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

These two lines can be replaced with limit = key.Clone().UserKey.

Done.


compaction.go, line 533 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

See two calls to findGrandparentLimit in this function is a bit confusing.

Done.


compaction_iter.go, line 583 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: This could be else if. Even better, I prefer using a switch statement when there are several conditionals:

switch {
case key == nil:
case exclude:
default:
}

I think this method deserves a comment. To be fair, it deserved a comment before this change.

Done.


options.go, line 293 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think it is worth rephrasing this to indicate that it is sstables generated by a flush that are split:

"FlushSplitBytes denotes the target size for sstables generated during a flush. If set to 0, only a single sstable will be generated by a flush. Splitting sstables during flush allows increased compaction flexibility and concurrency when those tables are compacted to lower levels."

"target size for sstables generated during a flush" isn't exactly accurate, as we also account for other sstables from past flushes / intra-L0 compactions that fall in that interval. The flush target size is set in the LevelOptions, and matters when this value is >=0 . I did include some of the points from your example comment.


internal/manifest/version.go, line 571 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Ping.

Done.

Copy link
Copy Markdown
Contributor Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 24 files reviewed, 5 unresolved discussions (waiting on @jbowens and @petermattis)


compaction.go, line 1666 at r9 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Filing.

Filed #734

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Reviewable status: 0 of 24 files reviewed, all discussions resolved (waiting on @jbowens and @petermattis)

@itsbilal
Copy link
Copy Markdown
Contributor Author

TFTR!

@itsbilal itsbilal merged commit fc886ec into cockroachdb:master Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants