*: write, compact, and read range keys from sstables by itsbilal · Pull Request #1738 · cockroachdb/pebble

itsbilal · 2022-06-07T14:22:30Z

Unfortunately a larger PR that touches a lot of files, but most of
the changes should be pretty straightforward. First commit
enables compactions/flushes of range keys, second
commit enables reading of range keys from sstables.

Commit messages:

This change implements compaction of sstables containing
range keys, as well as allows memtables with non-empty
rangeKeySkl skiplists to write correctly to sstable range
key blocks in flushes. Range key elision and snapshot
striping is implement in a keyspan.Transform passed to
keyspan.MergingIter inside the compaction iter. Otherwise,
behaviour is kept as close to that of range deletions
as possible, to improve confidence in correctness.

Fixes #1686.

The previous commit, which enabled flushing and compactions
of range keys, now makes the in-memory range key arena obsolete
for reading range keys. This change sets up iterators to read
range keys from sstable range key blocks, and makes many
(far too many) minor fixes in other range key code to
make that happen correctly.

cockroach-teamcity · 2022-06-07T14:22:39Z

This change is

jbowens

flushing some comments on the first commit..

this is looking 🔥

Reviewed 13 of 13 files at r1.
Reviewable status: 8 of 27 files reviewed, 9 unresolved discussions (waiting on @itsbilal and @jbowens)

compaction.go line 325 at r1 (raw file):

		dst.Start = s.Start
		dst.End = s.End
		dst.Keys = make([]keyspan.Key, 0, len(s.Keys))

dst.Keys is owned by MergingIter, and we're free to append directly to the existing slice so that it's reused between transformed spans. This requires the compaction loop to perform a deep clone before adding to the fragmenter, but it matches the current lifetime contract.

compaction.go line 326 at r1 (raw file):

		dst.End = s.End
		dst.Keys = make([]keyspan.Key, 0, len(s.Keys))
		partitions := make([][]keyspan.Key, 0, len(snapshots)+1)

can we create this slice once, in the body of compactionTransform before the closure, and reuse it?

this can be left for a TODO, but if we iterate the partitions in the opposite direction, I think we can avoid materializing the partitions and this slice goes away altogether.

compaction.go line 330 at r1 (raw file):

		for i < len(snapshots) {
			jEnd := j + 1
			for j >= 0 && s.Keys[j].SeqNum() <= snapshots[i] {

I believe keys should only be visible at strictly lower sequence numbers. eg, base.IsVisible:

// Visible returns true if a key with the provided sequence number is visible at
// the specified snapshot sequence number.
func Visible(seqNum uint64, snapshot uint64) bool {
	return seqNum < snapshot || (seqNum&InternalKeySeqNumBatch) != 0
}

compaction.go line 692 at r1 (raw file):

	}
	updateRangeKeyBounds := func(iter keyspan.FragmentIterator) {
		if s := iter.First(); s.Valid() {

mind adding a comment here that SmallestKey and LargestKey require !s.Empty(), and s.Valid() doesn't generally imply !s.Empty(), but in this case we're guaranteed because this function is used with the memtable's fragment iterator which never surfaces empty spans.

compaction.go line 957 at r1 (raw file):

// pairs at c.outputLevel.level+1 or higher that possibly overlap the specified
// span.
func (c *compaction) elideSpan(start, end []byte) bool {

given that we know we'll want to split out the inuse key ranges, mind keeping elideRangeTombstone and adding a separate elideRangeKey method (I think it can literally call elideRangeTombstone). Just so we don't have the code churn of combining the two methods and then splitting them back out soon after.

compaction.go line 2326 at r1 (raw file):

			// on sequence number at this stage is difficult, and necessitates
			// read-time logic to ignore range tombstones outside file bounds.
			if err := rangedel.Encode(v, tw.Add); err != nil {

yikes, good catch

compaction_iter.go line 293 at r1 (raw file):

	for i.iterKey != nil {
		if i.iterKey.Kind() == InternalKeyKindRangeDelete || i.iterKey.Kind() == InternalKeyKindRangeKeySet ||
			i.iterKey.Kind() == InternalKeyKindRangeKeyUnset || i.iterKey.Kind() == InternalKeyKindRangeKeyDelete {

nit: there's a rangekey.IsRangeKey helper

compaction_iter_test.go line 246 at r1 (raw file):

						if iter.Key().Kind() == InternalKeyKindRangeKeySet ||
							iter.Key().Kind() == InternalKeyKindRangeKeyUnset ||
							iter.Key().Kind() == InternalKeyKindRangeKeyDelete {

nit: rangekey.IsRangeKey

mem_table_test.go line 57 at r1 (raw file):

	}
	if key.Kind() == InternalKeyKindRangeKeyDelete ||
		key.Kind() == InternalKeyKindRangeKeySet || key.Kind() == InternalKeyKindRangeKeyUnset {

nit: rangekey.IsRangeKey

itsbilal

TFTR!

Reviewable status: 8 of 27 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @jbowens)

compaction.go line 325 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

dst.Keys is owned by MergingIter, and we're free to append directly to the existing slice so that it's reused between transformed spans. This requires the compaction loop to perform a deep clone before adding to the fragmenter, but it matches the current lifetime contract.

Done. Good catch.

compaction.go line 326 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

can we create this slice once, in the body of compactionTransform before the closure, and reuse it?

this can be left for a TODO, but if we iterate the partitions in the opposite direction, I think we can avoid materializing the partitions and this slice goes away altogether.

Done; reusing the partitions slice. Added a TODO for the latter for now, but if I have some time later this week I'll try to optimize this as-is.

compaction.go line 330 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I believe keys should only be visible at strictly lower sequence numbers. eg, base.IsVisible:

// Visible returns true if a key with the provided sequence number is visible at
// the specified snapshot sequence number.
func Visible(seqNum uint64, snapshot uint64) bool {
	return seqNum < snapshot || (seqNum&InternalKeySeqNumBatch) != 0
}

Done. Good catch.

compaction.go line 692 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

mind adding a comment here that SmallestKey and LargestKey require !s.Empty(), and s.Valid() doesn't generally imply !s.Empty(), but in this case we're guaranteed because this function is used with the memtable's fragment iterator which never surfaces empty spans.

Done.

compaction.go line 957 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

given that we know we'll want to split out the inuse key ranges, mind keeping elideRangeTombstone and adding a separate elideRangeKey method (I think it can literally call elideRangeTombstone). Just so we don't have the code churn of combining the two methods and then splitting them back out soon after.

Done.

itsbilal · 2022-06-08T18:28:03Z

compaction.go line 326 at r1 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

Done; reusing the partitions slice. Added a TODO for the latter for now, but if I have some time later this week I'll try to optimize this as-is.

Ended up addressing the TODO.

jbowens

Reviewed 22 of 22 files at r3, 18 of 18 files at r4, 4 of 19 files at r5.
Reviewable status: 9 of 27 files reviewed, 7 unresolved discussions (waiting on @itsbilal and @jbowens)

compaction.go line 2312 at r4 (raw file):

	finishOutput := func(splitKey []byte) error {
		// If we haven't output any point records to the sstable (tw == nil)
		// then the sstable will only contain range tombstones. The smallest

update comment to "only contain range tombstones and/or range keys?"

compaction.go line 2314 at r4 (raw file):

		// then the sstable will only contain range tombstones. The smallest
		// key in the sstable will be the start key of the first range
		// tombstone added. We need to ensure that this start key is distinct

"first range tombstone or range key added"

compaction.go line 1193 at r6 (raw file):

					atomicUnit, _ := expandToAtomicUnit(c.cmp, level.files, true /* disableIsCompacting */)
					lowerBound, upperBound := manifest.KeyRange(c.cmp, atomicUnit.Iter())
					iter = keyspan.Truncate(

we can omit this, with a comment explaining its absence, because range keys have never been written before, and we never split user keys across sstables anymore. iter != nil implies the file's atomic compaction unit is just the file itself. also range keys have only ever been written properly truncated and wholly contained within the containing sstable's boundaries.

internal/keyspan/interleaving_iter.go line 559 at r6 (raw file):

	s := i.keyspanIter.SeekLT(key)
	if !s.Valid() {
		s = i.keyspanIter.SeekGE(key)

is this change necessary? I'd expect the Next in the !Valid() case to be equivalent and more performant

internal/keyspan/level_iter.go line 198 at r4 (raw file):

	f := l.findFileGE(key)
	if f != nil && l.keyType == manifest.KeyTypeRange && l.cmp(key, f.SmallestRangeKey.UserKey) < 0 {
		prevFile := l.files.Prev()

why is this prev necessary? isn't it okay to unconditionally return the straddle span for [key, f.SmallestRangeKey.UserKey)? and then if there's no previous file, a prev will subsequently exhaust the iterator?

internal/keyspan/level_iter.go line 203 at r4 (raw file):

			// Return a straddling key instead of loading the file.
			l.iterFile = f
			if err := l.Close(); err != nil {

i don't understand this change. is this closing the leveliter? maybe we're just closing l.iter, in which case I think it would read a little easier to inline that here.

internal/keyspan/level_iter.go line 262 at r6 (raw file):

func (l *LevelIter) First() Span {
	l.dir = +1
	l.straddle = Span{}

nit: wdyt about using straddleDir == 0 to signal no straddle, and zeroing just straddleDir at the beginning of each absolute positioning method

nicktrav

Also flushing some comments on the first commit. Exciting!

Reviewed 5 of 13 files at r1, 3 of 22 files at r3, 19 of 19 files at r5.
Reviewable status: 9 of 27 files reviewed, 14 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)

compaction.go line 315 at r5 (raw file):

// compactionTransform is used to transform range key spans as part of the
// keyspan.MergingIter. As part of this transformation step, we can elide range
// keys in the last snapshot stripe, as well as coalesce range keys within

Curious - do we typically use "last" when referring to snapshots? Thoughts on "bottom-most" or "earliest", instead?

compaction.go line 317 at r5 (raw file):

// keys in the last snapshot stripe, as well as coalesce range keys within
// snapshot stripes.
func compactionTransform(

supernit: something more specific to range keys for this function name? rangeKeyCompactionTransform?

compaction.go line 690 at r5 (raw file):

		}
	}
	updateRangeKeyBounds := func(iter keyspan.FragmentIterator) {

At a glance this func seems identical to the one for range keys?

compaction.go line 1172 at r5 (raw file):

		// Check if this level has any range keys.
		hasRangeKeys := false
		level.files.Each(func(f *manifest.FileMetadata) {

Could you use level.files.Iter() here then do a break in the loop after seeing the first file with range keys?

testdata/compaction_iter line 1206 at r5 (raw file):

next
next
range-keys

Could we make this test a little less error prone / brittle by unconditionally writing out the range keys at the end, if present? e.g. this test could become

iter
first
next
next
next
next
----
a#72057594037927935,21:
a#2,0:
c#3,1:val
d#72057594037927935,21:
.
// START range-keys
a-b:{(#3,RANGEKEYSET,@2,foo)}
d-e:{(#3,RANGEKEYSET,@2,foo)}
.

I'm thinking of the case where our future selves writes a test that has range keys in the input, and forgets to call range-keys in the iter block, so we miss out on the assertions on the range keys. If for whatever reason we introduced a change that started output range keys, we'd not see them, because we forgot to call range-keys.

Wdyt?

testdata/compaction_transform line 46 at r5 (raw file):

a-c:{(#11,RANGEKEYDEL) (#8,RANGEKEYSET,@3,foo5)}

# Test RANGEKEYDELs are preserved over in-use key ranges in the last snapshot stripe.

nit: this comment applies to the test case after the vanilla transform one. Possible to move this comment down to cover the cases that use in-use-key-ranges, and have a dedicated comment for the cases that deal with a single snapshot, where we expect the DELs to be elided?

I also found myself wanting an in-line explanation of "in-use key range". Something like "... we can't elide DELs and UNSETs yet as they still have work to do at lower levels", or something like that.

testdata/compaction_transform line 48 at r5 (raw file):

# Test RANGEKEYDELs are preserved over in-use key ranges in the last snapshot stripe.

transform

Can we also add some test cases for compacting a DEL, SET and UNSET at the same sequence number in the last snapshot stripe (and maybe some in-use-range variants too)?

The following passes, as I expect based on my re-reading of this comment (i.e. DEL is elided as it's in the last snapshot, SET shadows UNSET).

transform
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@4) (#11,RANGEKEYDEL)
----
a-c:{(#11,RANGEKEYSET,@3,foo5)}

While playing around with this, I also tried the following (moving the DEL up to be first):

transform
a-c:{(#11,RANGEKEYDEL) (#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@4)}

And that returns nothing. I think the latter test case is invalid based on the sorting requirements (SET sorts before UNSET which sorts before DEL), so maybe we can guard against invalid test input somehow to avoid this footgun?

jbowens

Mind correcting this so that we test equivalence between ingest and batch application for sstables containing range keys in the metamorphic tests now?

pebble/internal/metamorphic/ops.go

Lines 344 to 351 in ad44a62

    
           // Ingests currently discard range keys. Using apply as an alternative 
        
           // to ingestion would create a divergence, since batch applications do 
        
           // commit range keys. Only allow the ingest to be applied as a batch if 
        
           // it doesn't contain any range keys. 
        
           // TODO(jackson): When range keys are properly persisted, allow 
        
           // tables containing range keys to be applied as batches. 
        
           if rangeKeyIter != nil { 
        
           	closeIters(iter, rangeDelIter, rangeKeyIter)

If it turns out there's still work to do in the ingest path, we can defer it to a followup, but we should file an issue.

Reviewable status: 9 of 27 files reviewed, 14 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)

nicktrav

Finished with the second commit.

I had some questions on the test case changes that I think are related to Jackson's questions on the level iterr changes.

Reviewed 18 of 18 files at r6, all commit messages.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @itsbilal)

compaction_iter.go line 510 at r6 (raw file):

		return newStripe
	case InternalKeyKindRangeKeySet, InternalKeyKindRangeKeyUnset, InternalKeyKindRangeKeyDelete:
		panic("unreachable")

Worthy of an associated comment saying why.

iterator.go line 616 at r6 (raw file):

					if i.pos == iterPosNext || i.pos == iterPosCurForward ||
						i.pos == iterPosCurForwardPaused {
						containsKey = i.cmp(file.SmallestPointKey.UserKey, i.key) <= 0

Meta comment: we have a lot of existing call sites for {Smallest,Largest}.UserKey. How confident are we that we are updating all of the calls sites that need to disambiguate between points and ranges? Are you going through on a case by case basis?

internal/keyspan/testdata/level_iter line 204 at r6 (raw file):

next
----
d-e:{} (file = 000001.sst)

I remember getting confused by this on the original PR (see our conversation here). In the forward iteration case, seeking into an empty span returns a straddle span starting at the seeked key. But in the reverse case we don't seem to be doing that anymore?

internal/keyspan/testdata/level_iter line 249 at r6 (raw file):

----
c-d:{(#2,RANGEKEYSET,@3,foo) (#1,RANGEKEYSET,@1,bar)} (file = 000002.sst)
b-c:{} (file = 000002.sst)

Was this a latent bug - we were on c-d already, but calling prev returns it again?

itsbilal

TFTRs! And made the metamorphic test change.

I didn't address all the comments yet (some are a little more back-and-forth-y) but I should close them out on the next update. Also need to do a rebase now that #1748 is in.

Reviewable status: 7 of 28 files reviewed, 11 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)

compaction.go line 2312 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

update comment to "only contain range tombstones and/or range keys?"

Done.

compaction.go line 2314 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

"first range tombstone or range key added"

Done.

compaction.go line 315 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Curious - do we typically use "last" when referring to snapshots? Thoughts on "bottom-most" or "earliest", instead?

We do use last snapshot stripe elsewhere in compaction_iter so I think that's the more consistent lingo.

compaction.go line 317 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

supernit: something more specific to range keys for this function name? rangeKeyCompactionTransform?

Done.

compaction.go line 690 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

At a glance this func seems identical to the one for range keys?

That is correct. Updated.

compaction.go line 1172 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Could you use level.files.Iter() here then do a break in the loop after seeing the first file with range keys?

Done.

compaction.go line 1193 at r6 (raw file):

Previously, jbowens (Jackson Owens) wrote…

we can omit this, with a comment explaining its absence, because range keys have never been written before, and we never split user keys across sstables anymore. iter != nil implies the file's atomic compaction unit is just the file itself. also range keys have only ever been written properly truncated and wholly contained within the containing sstable's boundaries.

Done.

compaction_iter.go line 510 at r6 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Worthy of an associated comment saying why.

Done.

iterator.go line 616 at r6 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Meta comment: we have a lot of existing call sites for {Smallest,Largest}.UserKey. How confident are we that we are updating all of the calls sites that need to disambiguate between points and ranges? Are you going through on a case by case basis?

Yes, I'm going through all uses of Smallest and Largest. Generally it's pretty clear cut based on the iterator stacks themselves, it's really only compaction picking where we need to continue using the overall file bounds.

internal/keyspan/level_iter.go line 198 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

why is this prev necessary? isn't it okay to unconditionally return the straddle span for [key, f.SmallestRangeKey.UserKey)? and then if there's no previous file, a prev will subsequently exhaust the iterator?

It's not necessary but it was helpful in making these cases more consistent and aiding a lot of the investigation. I figured we may keep it as this seems safer, but if there are performance concerns I could revert this.

internal/keyspan/level_iter.go line 203 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

i don't understand this change. is this closing the leveliter? maybe we're just closing l.iter, in which case I think it would read a little easier to inline that here.

Yes, it's closing the levelIter but the Close really only closes the underlying Iter if it exists. We already do the same thing in loadFile and it saves a few lines of repetitive code. We also do the same thing in the point levelIter.loadFile.

internal/keyspan/level_iter.go line 262 at r6 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: wdyt about using straddleDir == 0 to signal no straddle, and zeroing just straddleDir at the beginning of each absolute positioning method

That also works, though l.straddle.Valid() reads more easier. We can probably keep one of the two; currently I check for both in many places. Is the performance worse for setting the whole span? (Probably).

testdata/compaction_iter line 1206 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Could we make this test a little less error prone / brittle by unconditionally writing out the range keys at the end, if present? e.g. this test could become
iter
first
next
next
next
next
----
a#72057594037927935,21:
a#2,0:
c#3,1:val
d#72057594037927935,21:
.
// START range-keys
a-b:{(#3,RANGEKEYSET,@2,foo)}
d-e:{(#3,RANGEKEYSET,@2,foo)}
.
I'm thinking of the case where our future selves writes a test that has range keys in the input, and forgets to call range-keys in the iter block, so we miss out on the assertions on the range keys. If for whatever reason we introduced a change that started output range keys, we'd not see them, because we forgot to call range-keys.

Wdyt?

We could do that, that's a good idea. I mimicked the way we're currently handling tombstones just to keep it consistent, but more assertions are a good idea.

testdata/compaction_transform line 46 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: this comment applies to the test case after the vanilla transform one. Possible to move this comment down to cover the cases that use in-use-key-ranges, and have a dedicated comment for the cases that deal with a single snapshot, where we expect the DELs to be elided?

I also found myself wanting an in-line explanation of "in-use key range". Something like "... we can't elide DELs and UNSETs yet as they still have work to do at lower levels", or something like that.

Done.

testdata/compaction_transform line 48 at r5 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Can we also add some test cases for compacting a DEL, SET and UNSET at the same sequence number in the last snapshot stripe (and maybe some in-use-range variants too)?

The following passes, as I expect based on my re-reading of this comment (i.e. DEL is elided as it's in the last snapshot, SET shadows UNSET).
transform
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@4) (#11,RANGEKEYDEL)
----
a-c:{(#11,RANGEKEYSET,@3,foo5)}
While playing around with this, I also tried the following (moving the DEL up to be first):
transform
a-c:{(#11,RANGEKEYDEL) (#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@4)}
And that returns nothing. I think the latter test case is invalid based on the sorting requirements (SET sorts before UNSET which sorts before DEL), so maybe we can guard against invalid test input somehow to avoid this footgun?

In your example the UNSET was elided because it's in the last snapshot stripe, not because the SET shadows it. The SET and UNSET are on different prefixes so they would coexist if they weren't in the last snapshot stripe. I added both of those example cases, but not the invalid one. Added the invalid guardrail too.

internal/keyspan/testdata/level_iter line 204 at r6 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I remember getting confused by this on the original PR (see our conversation here). In the forward iteration case, seeking into an empty span returns a straddle span starting at the seeked key. But in the reverse case we don't seem to be doing that anymore?

Good catch, I had intended to do it for SeekGE too but I was just missing one line.

internal/keyspan/testdata/level_iter line 249 at r6 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Was this a latent bug - we were on c-d already, but calling prev returns it again?

I believe so.

internal/keyspan/interleaving_iter.go line 559 at r6 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is this change necessary? I'd expect the Next in the !Valid() case to be equivalent and more performant

Done. This is technically not in the FragmentIterator contract (and was something LevelIter was quietly violating) but I should have fixed up the violation now, making this easier.

nicktrav

I closed out all my comments, and left an additional test case. Adding my , but def wait for Jackson as the binding approver.

This is great!

Reviewed 2 of 20 files at r7, 19 of 19 files at r8, all commit messages.
Reviewable status: all files reviewed, 8 unresolved discussions (waiting on @itsbilal and @jbowens)

testdata/compaction_transform line 48 at r5 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

In your example the UNSET was elided because it's in the last snapshot stripe, not because the SET shadows it. The SET and UNSET are on different prefixes so they would coexist if they weren't in the last snapshot stripe. I added both of those example cases, but not the invalid one. Added the invalid guardrail too.

Whoops, my example was a bad one. I intended for them to be at the same suffix, say @3. So:

transform
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@3) (#11,RANGEKEYDEL)
----
a-c:{(#11,RANGEKEYSET,@3,foo5)}

I think the point is moot, as the unset drops out anyway, as you mention.

testdata/compaction_transform line 90 at r8 (raw file):


transform
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@4) (#11,RANGEKEYDEL)}

Per above comment, I intended for my example to have the UNSET at suffix @3 (the same as the SET). Maybe we can have both cases?

And mind adding one more test case for when elision is disabled? I think this will underscore the UNSETs being shadowed, rather than elided:

transform disable-elision
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@3) (#11,RANGEKEYDEL)
----
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYDEL)}

itsbilal

TFTRs! Did the rebase too, and closed out many conversations.

Dismissed @jbowens from 4 discussions.
Reviewable status: 3 of 28 files reviewed, 2 unresolved discussions (waiting on @jbowens and @nicktrav)

internal/keyspan/level_iter.go line 198 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

It's not necessary but it was helpful in making these cases more consistent and aiding a lot of the investigation. I figured we may keep it as this seems safer, but if there are performance concerns I could revert this.

I reverted this out of curiosity and then realized that some of the iters (DefragmentingIter, InterleavingIter etc) rely on some level of straddle key consistency, without which you see out-of-order keys. In particular, if all your range keys have starts >= b, and you SeekGE(a), in the old code you'd return an empty a-b but if you next and then prev you exhaust the iterator.

I could file an issue to bring back this optimization, but at the moment it just seems easier to do the simpler / more correct straddle case and move on.

internal/keyspan/level_iter.go line 262 at r6 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

That also works, though l.straddle.Valid() reads more easier. We can probably keep one of the two; currently I check for both in many places. Is the performance worse for setting the whole span? (Probably).

I ended up doing this, to make it easier to reconcile with the *Span change.

testdata/compaction_transform line 90 at r8 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Per above comment, I intended for my example to have the UNSET at suffix @3 (the same as the SET). Maybe we can have both cases?

And mind adding one more test case for when elision is disabled? I think this will underscore the UNSETs being shadowed, rather than elided:
transform disable-elision
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYUNSET,@3) (#11,RANGEKEYDEL)
----
a-c:{(#11,RANGEKEYSET,@3,foo5) (#11,RANGEKEYDEL)}

Done.

jbowens

Looks great! Thanks for doing this!

Reviewed 15 of 25 files at r9, 16 of 18 files at r10, all commit messages.
Reviewable status: 26 of 28 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @nicktrav)

compaction.go line 687 at r10 (raw file):

	updateRangeBounds := func(iter keyspan.FragmentIterator) {
		// File bounds require s.Valid() && !s.Empty(). We only need to check for

nit: File bounds require s != nil && !s.Empty()

compaction.go line 688 at r10 (raw file):

	updateRangeBounds := func(iter keyspan.FragmentIterator) {
		// File bounds require s.Valid() && !s.Empty(). We only need to check for
		// validity here, as the memtable's FragmentIterator would never surface

nit: s/validity/nilness/

compaction.go line 2644 at r10 (raw file):

						Keys:  make([]keyspan.Key, len(s.Keys)),
					}
					copy(clone.Keys, s.Keys)

mind adding a comment that keys' Suffix and Value are not cloned, which is what necessitates the range key blocks being held in memory for the lifetime of the compaction?

internal/keyspan/level_iter.go line 198 at r4 (raw file):

Previously, itsbilal (Bilal Akhtar) wrote…

I reverted this out of curiosity and then realized that some of the iters (DefragmentingIter, InterleavingIter etc) rely on some level of straddle key consistency, without which you see out-of-order keys. In particular, if all your range keys have starts >= b, and you SeekGE(a), in the old code you'd return an empty a-b but if you next and then prev you exhaust the iterator.

I could file an issue to bring back this optimization, but at the moment it just seems easier to do the simpler / more correct straddle case and move on.

Ah, I see, thanks for the explanation. Mind adding that example in a comment here and in SeekLT?

There's a similar dynamic higher up the iterator stack that resulted in this wart.

This change implements compaction of sstables containing range keys, as well as allows memtables with non-empty rangeKeySkl skiplists to write correctly to sstable range key blocks in flushes. Range key elision and snapshot striping is implement in a keyspan.Transform passed to keyspan.MergingIter inside the compaction iter. Otherwise, behaviour is kept as close to that of range deletions as possible, to improve confidence in correctness. Fixes cockroachdb#1686.

The previous commit, which enabled flushing and compactions of range keys, now makes the in-memory range key arena obsolete for reading range keys. This change sets up iterators to read range keys from sstable range key blocks, and makes many (far too many) minor fixes in other range key code to make that happen correctly.

itsbilal

Thanks a ton! Addressed all the comments.

Reviewable status: 10 of 28 files reviewed, 2 unresolved discussions (waiting on @jbowens and @nicktrav)

compaction.go line 687 at r10 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: File bounds require s != nil && !s.Empty()

Done.

compaction.go line 688 at r10 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: s/validity/nilness/

Done.

compaction.go line 2644 at r10 (raw file):

Previously, jbowens (Jackson Owens) wrote…

mind adding a comment that keys' Suffix and Value are not cloned, which is what necessitates the range key blocks being held in memory for the lifetime of the compaction?

Done.

internal/keyspan/level_iter.go line 198 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Ah, I see, thanks for the explanation. Mind adding that example in a comment here and in SeekLT?

There's a similar dynamic higher up the iterator stack that resulted in this wart.

Done.

itsbilal requested review from a team and jbowens June 7, 2022 14:22

itsbilal self-assigned this Jun 7, 2022

itsbilal force-pushed the range-keys-compact branch 2 times, most recently from e6e7063 to bb516e1 Compare June 7, 2022 18:59

jbowens reviewed Jun 7, 2022

View reviewed changes

itsbilal force-pushed the range-keys-compact branch from bb516e1 to 07dd4a2 Compare June 8, 2022 15:12

itsbilal commented Jun 8, 2022

View reviewed changes

itsbilal force-pushed the range-keys-compact branch from 07dd4a2 to eaa27dd Compare June 8, 2022 18:27

jbowens reviewed Jun 8, 2022

View reviewed changes

nicktrav reviewed Jun 8, 2022

View reviewed changes

jbowens reviewed Jun 9, 2022

View reviewed changes

nicktrav reviewed Jun 9, 2022

View reviewed changes

itsbilal force-pushed the range-keys-compact branch from eaa27dd to 55dc153 Compare June 9, 2022 20:13

itsbilal commented Jun 9, 2022

View reviewed changes

nicktrav approved these changes Jun 9, 2022

View reviewed changes

itsbilal force-pushed the range-keys-compact branch from 55dc153 to 3ff4d5b Compare June 10, 2022 15:07

itsbilal commented Jun 10, 2022

View reviewed changes

jbowens approved these changes Jun 10, 2022

View reviewed changes

itsbilal added 2 commits June 10, 2022 12:48

itsbilal force-pushed the range-keys-compact branch from 3ff4d5b to 4460fa5 Compare June 10, 2022 16:48

itsbilal commented Jun 10, 2022

View reviewed changes

itsbilal merged commit ae99f4f into cockroachdb:master Jun 10, 2022

jbowens mentioned this pull request Jun 13, 2022

crl-release-22.1: db: fix newIters allocation regression #1763

Merged

	// Ingests currently discard range keys. Using apply as an alternative
	// to ingestion would create a divergence, since batch applications do
	// commit range keys. Only allow the ingest to be applied as a batch if
	// it doesn't contain any range keys.
	// TODO(jackson): When range keys are properly persisted, allow
	// tables containing range keys to be applied as batches.
	if rangeKeyIter != nil {
	closeIters(iter, rangeDelIter, rangeKeyIter)

Conversation

itsbilal commented Jun 7, 2022

Uh oh!

cockroach-teamcity commented Jun 7, 2022

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal commented Jun 8, 2022

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

nicktrav left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

nicktrav left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

nicktrav left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

jbowens left a comment

Choose a reason for hiding this comment

Uh oh!

itsbilal left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants