Skip to content

db: use InterleavingIter to interleave range deletions in compactions#3219

Merged
jbowens merged 1 commit intocockroachdb:masterfrom
jbowens:compaction-rangedelinterleaving
Jan 17, 2024
Merged

db: use InterleavingIter to interleave range deletions in compactions#3219
jbowens merged 1 commit intocockroachdb:masterfrom
jbowens:compaction-rangedelinterleaving

Conversation

@jbowens
Copy link
Copy Markdown
Contributor

@jbowens jbowens commented Jan 11, 2024

Remove the keyspan.InternalIteratorShim that was intended to be a temporary bridge to allow range deletions to be surfaced within the compaction iterator, replacing it with the keyspan.InterleavingIter. This mirrors the mechanics of range keys. Follow up work will be able to simplify the compaction iterator logic now that range deletions are always interleaved at the maximal sequence number.

Additionally, this commit removes the compaction iterator's concept of a key that lies within the same snapshot stripe but is non-skippable. These "non-skippable" keys lead to bizarre behavior, such as a sequence like:

a.MERGE.5 a.RANGEDEL.4 a.MERGE.4

yielding the separate unmerged a.MERGE.5 and a.MERGE.4 internal keys (Note a.RANGEDEL.4 does not delete a.MERGE.4 because they share the same sequence number.)

The sameStripeNonSkippable fell out of the fact that range deletions were previously interleaved by the input iterator at their start key with their sequence number. These sameStripeNonSkippable could interrupt any logic iterating across the keys of a stripe, complicating the compaction iterator logic.

In some instances INVALID keys were also returned with sameStripeNonSkippable. These keys are now treated similarly to other errors encountered during iteration: i.err is set and newStripeNewKey is returned.

Close #3082.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jbowens jbowens requested review from a team and sumeerbhola January 11, 2024 16:44
Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 1 unresolved discussion (waiting on @jbowens and @sumeerbhola)


compaction.go line 1383 at r1 (raw file):

1

@jbowens jbowens force-pushed the compaction-rangedelinterleaving branch from 31ca00d to ab0e2b3 Compare January 11, 2024 18:46
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


compaction.go line 1383 at r1 (raw file):

Previously, RaduBerinde wrote…

1

Good catch; looks like a long latent bug


testdata/manual_compaction_set_with_del_sstable_Pebblev4 line 705 at r1 (raw file):

2:
  000011:[a#4,RANGEDEL-d#inf,RANGEDEL]
  000012:[k#5,RANGEDEL-q#inf,RANGEDEL]

this m -> q diff is a consequence of now using the defragmenting iterator. Previously, the compaction iterator would see the individual fragmented spans from each sstable in L1 and could elide the m-q rangedel. Now it sees one broader

@jbowens jbowens force-pushed the compaction-rangedelinterleaving branch from ab0e2b3 to 0a948bd Compare January 11, 2024 19:03
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


testdata/manual_compaction_set_with_del_sstable_Pebblev4 line 705 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this m -> q diff is a consequence of now using the defragmenting iterator. Previously, the compaction iterator would see the individual fragmented spans from each sstable in L1 and could elide the m-q rangedel. Now it sees one broader

To keep the diff minimal, I removed the defragmenting iterator, and we can consider defragmenting range deletions in a separate patch.

@jbowens jbowens force-pushed the compaction-rangedelinterleaving branch from 0a948bd to c20b86e Compare January 11, 2024 19:06
jbowens added a commit to jbowens/pebble that referenced this pull request Jan 12, 2024
This commit removes the compaction iterator's concept of a key that lies within
the same snapshot stripe but is non-skippable. These "non-skippable" keys lead
to bizarre behavior, such as a sequence like:

  a.MERGE.5 a.RANGEDEL.4 a.MERGE.4

yielding the separate unmerged a.MERGE.5 and a.MERGE.4 internal keys (Note
a.RANGEDEL.4 does not delete a.MERGE.4 because they share the same sequence
number.)

The sameStripeNonSkippable fell out of the fact that range deletions were
previously interleaved by the input iterator at their start key with their
sequence number. These sameStripeNonSkippable could interrupt any logic
iterating across the keys of a stripe, complicating the compaction iterator
logic. With cockroachdb#3219, range deletions began to be interleaved at their start key
with the maximal sequence number, ensuring they never interrupt the keys of a
snapshot stripe.

In some instances INVALID keys were also returned with sameStripeNonSkippable.
These keys are now treated similarly to other errors encountered during
iteration: i.err is set and newStripeNewKey is returned.

Close cockroachdb#3082.
Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 10 files at r1, 3 of 3 files at r3.
Reviewable status: all files reviewed (commit messages unreviewed), 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)


compaction.go line 630 at r3 (raw file):

	// The range deletion tombstone iterator, that merges and fragments
	// tombstones across levels. This iterator is included within the compaction
	// input iterator as a single level.

this comment seems stale now that the point keys are also interleaved by this.


compaction.go line 631 at r3 (raw file):

	// tombstones across levels. This iterator is included within the compaction
	// input iterator as a single level.
	rangeDelInterleaving keyspan.InterleavingIter

meta question about this change: how much additional risk are we incurring by using InterleavingIter, given that we cannot afford any bug in our compaction iteration code. I am assuming the answer is very low especially since we use only First() and Next(), but do we need some additional randomized testing of InterleavingIter to give us more confidence?


testdata/compaction_delete_only_hints line 224 at r1 (raw file):

----
6:
  000004:[a#20,SETWITHDEL-z#inf,RANGEDEL]

is this diff is because we don't get interrupted by sameStripeNonSkippable after seeing the SET?


testdata/compaction_delete_only_hints line 258 at r1 (raw file):

10
----
[JOB 100] compacted(elision-only) L6 [000004] (741B) Score=0.00 + L6 [] (0B) Score=0.00 -> L6 [000005] (662B), in 1.0s (2.0s total), output rate 662B/s

why did the byte count change?


testdata/range_del line 1196 at r3 (raw file):

----
2:
  000013:[a#11,SETWITHDEL-c#inf,RANGEDEL]

why did the file boundaries change?

@jbowens jbowens force-pushed the compaction-rangedelinterleaving branch 2 times, most recently from 0c5103b to e22c475 Compare January 16, 2024 18:04
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 11 files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


compaction.go line 631 at r3 (raw file):

Previously, sumeerbhola wrote…

meta question about this change: how much additional risk are we incurring by using InterleavingIter, given that we cannot afford any bug in our compaction iteration code. I am assuming the answer is very low especially since we use only First() and Next(), but do we need some additional randomized testing of InterleavingIter to give us more confidence?

Yeah, I think it's very low. We use the interleaving iterator today in both user iteration and compaction iterator for range keys. Focused randomized testing would be nice, but I would be shocked if it picked up something that the metamorphic tests haven't yet.


testdata/compaction_delete_only_hints line 224 at r1 (raw file):

Previously, sumeerbhola wrote…

is this diff is because we don't get interrupted by sameStripeNonSkippable after seeing the SET?

That's right


testdata/compaction_delete_only_hints line 258 at r1 (raw file):

Previously, sumeerbhola wrote…

why did the byte count change?

Thanks for the prod here. This was a result of the snapshot-pinned sstable properties being written when they weren't previously. They were improperly considering the point key as obsolete, because it had the same user key.

Relatedly, I think I should combine this commit with the one from #3221 so that the compaction iterator test cases reflect the reality of range deletions being interleaved first.


testdata/range_del line 1196 at r3 (raw file):

Previously, sumeerbhola wrote…

why did the file boundaries change?

This was an artifact of range deletions being interleaved first and the compaction splitter refusing to "split" user keys across files. The splitter is naive and does not know that it can split if the previous key was a range deletion (and is buffered in the fragmenter; not already written to the current output file). I updated the file size splitter to not special case range deletions, which allows this case to split here before the range deletion here. I think either behavior is reasonable here.

Copy link
Copy Markdown
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 4 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @jbowens and @RaduBerinde)


compaction_iter.go line 330 at r5 (raw file):

	}

	// Prior to this call to `Next()` we are in one of four situations with

nit: three situations


compaction_iter.go line 375 at r5 (raw file):

			// Note that `skip` must already be false here, because range keys and range
			// deletions are interleaved at the maximal sequence numbers and neither will
			// set `skip`=true.

can we assert that skip is false?


compaction_iter.go line 645 at r5 (raw file):

		//    number ordering within a user key. If the previous key was one
		//    of these keys, we consider the new key a `newStripeNewKey` to
		//    reflect that it's the beginning of a new stream of point keys.

#2 is a bit subtle, but I guess it is unavoidable.

I glanced through all the callers of nextInStripe that care about the return value and it seems all of them call when they are already on a point key. Just curious if there is a test case failing without this i.key.IsExclusiveSentinel()?


testdata/compaction_iter_delete_sized line 308 at r5 (raw file):

tombstones
----
a#72057594037927935,15:

We are not printing the seqnum of the keys in the span, so it is not possible to verify the correctness here. Could you add that? The fragmented tombstones printed below have the seqnums, but it would be nice if we also printed the unfragmented ones.
Hmm, these are already fragmented, so maybe it doesn't matter.

Remove the keyspan.InternalIteratorShim that was intended to be a temporary
bridge to allow range deletions to be surfaced within the compaction iterator,
replacing it with the keyspan.InterleavingIter. This mirrors the mechanics of
range keys.

Additionally, this commit removes the compaction iterator's concept of a key
that lies within the same snapshot stripe but is non-skippable. These
"non-skippable" keys lead to bizarre behavior, such as a sequence like:

  a.MERGE.5 a.RANGEDEL.4 a.MERGE.4

yielding the separate unmerged a.MERGE.5 and a.MERGE.4 internal keys (Note
a.RANGEDEL.4 does not delete a.MERGE.4 because they share the same sequence
number.)

The sameStripeNonSkippable fell out of the fact that range deletions were
previously interleaved by the input iterator at their start key with their
sequence number. These sameStripeNonSkippable could interrupt any logic
iterating across the keys of a stripe, complicating the compaction iterator
logic.

In some instances INVALID keys were also returned with sameStripeNonSkippable.
These keys are now treated similarly to other errors encountered during
iteration: i.err is set and newStripeNewKey is returned.

Close cockroachdb#3082.
@jbowens jbowens force-pushed the compaction-rangedelinterleaving branch from a237293 to d7c70b3 Compare January 17, 2024 22:34
Copy link
Copy Markdown
Contributor Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: 9 of 14 files reviewed, 9 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


compaction_iter.go line 375 at r5 (raw file):

Previously, sumeerbhola wrote…

can we assert that skip is false?

Done.


compaction_iter.go line 645 at r5 (raw file):

Previously, sumeerbhola wrote…

#2 is a bit subtle, but I guess it is unavoidable.

I glanced through all the callers of nextInStripe that care about the return value and it seems all of them call when they are already on a point key. Just curious if there is a test case failing without this i.key.IsExclusiveSentinel()?

Yeah. without it, a compaction iterator that sees a.RANGEDEL.1:z and a.SET.4:a4 will think the a.SET.4 should be considered a "snapshot-pinned" key because it'll encounter a.SET.4 with newStripeSameKey


testdata/compaction_iter_delete_sized line 308 at r5 (raw file):

Previously, sumeerbhola wrote…

We are not printing the seqnum of the keys in the span, so it is not possible to verify the correctness here. Could you add that? The fragmented tombstones printed below have the seqnums, but it would be nice if we also printed the unfragmented ones.
Hmm, these are already fragmented, so maybe it doesn't matter.

Added them because it at least makes it clearer what's happening

@jbowens jbowens merged commit b162166 into cockroachdb:master Jan 17, 2024
@jbowens jbowens deleted the compaction-rangedelinterleaving branch January 17, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

db: refactor/cleanup compaction iterator

4 participants