db: use InterleavingIter to interleave range deletions in compactions#3219
Conversation
RaduBerinde
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 10 files reviewed, 1 unresolved discussion (waiting on @jbowens and @sumeerbhola)
compaction.go line 1383 at r1 (raw file):
1
31ca00d to
ab0e2b3
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
compaction.go line 1383 at r1 (raw file):
Previously, RaduBerinde wrote…
1
Good catch; looks like a long latent bug
testdata/manual_compaction_set_with_del_sstable_Pebblev4 line 705 at r1 (raw file):
2: 000011:[a#4,RANGEDEL-d#inf,RANGEDEL] 000012:[k#5,RANGEDEL-q#inf,RANGEDEL]
this m -> q diff is a consequence of now using the defragmenting iterator. Previously, the compaction iterator would see the individual fragmented spans from each sstable in L1 and could elide the m-q rangedel. Now it sees one broader
ab0e2b3 to
0a948bd
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 0 of 10 files reviewed, 2 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
testdata/manual_compaction_set_with_del_sstable_Pebblev4 line 705 at r1 (raw file):
Previously, jbowens (Jackson Owens) wrote…
this
m->qdiff is a consequence of now using the defragmenting iterator. Previously, the compaction iterator would see the individual fragmented spans from each sstable in L1 and could elide the m-q rangedel. Now it sees one broader
To keep the diff minimal, I removed the defragmenting iterator, and we can consider defragmenting range deletions in a separate patch.
0a948bd to
c20b86e
Compare
This commit removes the compaction iterator's concept of a key that lies within the same snapshot stripe but is non-skippable. These "non-skippable" keys lead to bizarre behavior, such as a sequence like: a.MERGE.5 a.RANGEDEL.4 a.MERGE.4 yielding the separate unmerged a.MERGE.5 and a.MERGE.4 internal keys (Note a.RANGEDEL.4 does not delete a.MERGE.4 because they share the same sequence number.) The sameStripeNonSkippable fell out of the fact that range deletions were previously interleaved by the input iterator at their start key with their sequence number. These sameStripeNonSkippable could interrupt any logic iterating across the keys of a stripe, complicating the compaction iterator logic. With cockroachdb#3219, range deletions began to be interleaved at their start key with the maximal sequence number, ensuring they never interrupt the keys of a snapshot stripe. In some instances INVALID keys were also returned with sameStripeNonSkippable. These keys are now treated similarly to other errors encountered during iteration: i.err is set and newStripeNewKey is returned. Close cockroachdb#3082.
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 7 of 10 files at r1, 3 of 3 files at r3.
Reviewable status: all files reviewed (commit messages unreviewed), 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)
compaction.go line 630 at r3 (raw file):
// The range deletion tombstone iterator, that merges and fragments // tombstones across levels. This iterator is included within the compaction // input iterator as a single level.
this comment seems stale now that the point keys are also interleaved by this.
compaction.go line 631 at r3 (raw file):
// tombstones across levels. This iterator is included within the compaction // input iterator as a single level. rangeDelInterleaving keyspan.InterleavingIter
meta question about this change: how much additional risk are we incurring by using InterleavingIter, given that we cannot afford any bug in our compaction iteration code. I am assuming the answer is very low especially since we use only First() and Next(), but do we need some additional randomized testing of InterleavingIter to give us more confidence?
testdata/compaction_delete_only_hints line 224 at r1 (raw file):
---- 6: 000004:[a#20,SETWITHDEL-z#inf,RANGEDEL]
is this diff is because we don't get interrupted by sameStripeNonSkippable after seeing the SET?
testdata/compaction_delete_only_hints line 258 at r1 (raw file):
10 ---- [JOB 100] compacted(elision-only) L6 [000004] (741B) Score=0.00 + L6 [] (0B) Score=0.00 -> L6 [000005] (662B), in 1.0s (2.0s total), output rate 662B/s
why did the byte count change?
testdata/range_del line 1196 at r3 (raw file):
---- 2: 000013:[a#11,SETWITHDEL-c#inf,RANGEDEL]
why did the file boundaries change?
0c5103b to
e22c475
Compare
jbowens
left a comment
There was a problem hiding this comment.
Reviewable status: 7 of 11 files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
compaction.go line 631 at r3 (raw file):
Previously, sumeerbhola wrote…
meta question about this change: how much additional risk are we incurring by using
InterleavingIter, given that we cannot afford any bug in our compaction iteration code. I am assuming the answer is very low especially since we use onlyFirst()andNext(), but do we need some additional randomized testing ofInterleavingIterto give us more confidence?
Yeah, I think it's very low. We use the interleaving iterator today in both user iteration and compaction iterator for range keys. Focused randomized testing would be nice, but I would be shocked if it picked up something that the metamorphic tests haven't yet.
testdata/compaction_delete_only_hints line 224 at r1 (raw file):
Previously, sumeerbhola wrote…
is this diff is because we don't get interrupted by
sameStripeNonSkippableafter seeing theSET?
That's right
testdata/compaction_delete_only_hints line 258 at r1 (raw file):
Previously, sumeerbhola wrote…
why did the byte count change?
Thanks for the prod here. This was a result of the snapshot-pinned sstable properties being written when they weren't previously. They were improperly considering the point key as obsolete, because it had the same user key.
Relatedly, I think I should combine this commit with the one from #3221 so that the compaction iterator test cases reflect the reality of range deletions being interleaved first.
testdata/range_del line 1196 at r3 (raw file):
Previously, sumeerbhola wrote…
why did the file boundaries change?
This was an artifact of range deletions being interleaved first and the compaction splitter refusing to "split" user keys across files. The splitter is naive and does not know that it can split if the previous key was a range deletion (and is buffered in the fragmenter; not already written to the current output file). I updated the file size splitter to not special case range deletions, which allows this case to split here before the range deletion here. I think either behavior is reasonable here.
e22c475 to
a237293
Compare
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 3 of 4 files at r4, 4 of 4 files at r5, all commit messages.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @jbowens and @RaduBerinde)
compaction_iter.go line 330 at r5 (raw file):
} // Prior to this call to `Next()` we are in one of four situations with
nit: three situations
compaction_iter.go line 375 at r5 (raw file):
// Note that `skip` must already be false here, because range keys and range // deletions are interleaved at the maximal sequence numbers and neither will // set `skip`=true.
can we assert that skip is false?
compaction_iter.go line 645 at r5 (raw file):
// number ordering within a user key. If the previous key was one // of these keys, we consider the new key a `newStripeNewKey` to // reflect that it's the beginning of a new stream of point keys.
#2 is a bit subtle, but I guess it is unavoidable.
I glanced through all the callers of nextInStripe that care about the return value and it seems all of them call when they are already on a point key. Just curious if there is a test case failing without this i.key.IsExclusiveSentinel()?
testdata/compaction_iter_delete_sized line 308 at r5 (raw file):
tombstones ---- a#72057594037927935,15:
We are not printing the seqnum of the keys in the span, so it is not possible to verify the correctness here. Could you add that? The fragmented tombstones printed below have the seqnums, but it would be nice if we also printed the unfragmented ones.
Hmm, these are already fragmented, so maybe it doesn't matter.
Remove the keyspan.InternalIteratorShim that was intended to be a temporary bridge to allow range deletions to be surfaced within the compaction iterator, replacing it with the keyspan.InterleavingIter. This mirrors the mechanics of range keys. Additionally, this commit removes the compaction iterator's concept of a key that lies within the same snapshot stripe but is non-skippable. These "non-skippable" keys lead to bizarre behavior, such as a sequence like: a.MERGE.5 a.RANGEDEL.4 a.MERGE.4 yielding the separate unmerged a.MERGE.5 and a.MERGE.4 internal keys (Note a.RANGEDEL.4 does not delete a.MERGE.4 because they share the same sequence number.) The sameStripeNonSkippable fell out of the fact that range deletions were previously interleaved by the input iterator at their start key with their sequence number. These sameStripeNonSkippable could interrupt any logic iterating across the keys of a stripe, complicating the compaction iterator logic. In some instances INVALID keys were also returned with sameStripeNonSkippable. These keys are now treated similarly to other errors encountered during iteration: i.err is set and newStripeNewKey is returned. Close cockroachdb#3082.
a237293 to
d7c70b3
Compare
jbowens
left a comment
There was a problem hiding this comment.
TFTR!
Reviewable status: 9 of 14 files reviewed, 9 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
compaction_iter.go line 375 at r5 (raw file):
Previously, sumeerbhola wrote…
can we assert that skip is false?
Done.
compaction_iter.go line 645 at r5 (raw file):
Previously, sumeerbhola wrote…
#2 is a bit subtle, but I guess it is unavoidable.
I glanced through all the callers of
nextInStripethat care about the return value and it seems all of them call when they are already on a point key. Just curious if there is a test case failing without thisi.key.IsExclusiveSentinel()?
Yeah. without it, a compaction iterator that sees a.RANGEDEL.1:z and a.SET.4:a4 will think the a.SET.4 should be considered a "snapshot-pinned" key because it'll encounter a.SET.4 with newStripeSameKey
testdata/compaction_iter_delete_sized line 308 at r5 (raw file):
Previously, sumeerbhola wrote…
We are not printing the seqnum of the keys in the span, so it is not possible to verify the correctness here. Could you add that? The fragmented tombstones printed below have the seqnums, but it would be nice if we also printed the unfragmented ones.
Hmm, these are already fragmented, so maybe it doesn't matter.
Added them because it at least makes it clearer what's happening
Remove the keyspan.InternalIteratorShim that was intended to be a temporary bridge to allow range deletions to be surfaced within the compaction iterator, replacing it with the keyspan.InterleavingIter. This mirrors the mechanics of range keys. Follow up work will be able to simplify the compaction iterator logic now that range deletions are always interleaved at the maximal sequence number.
Additionally, this commit removes the compaction iterator's concept of a key that lies within the same snapshot stripe but is non-skippable. These "non-skippable" keys lead to bizarre behavior, such as a sequence like:
a.MERGE.5 a.RANGEDEL.4 a.MERGE.4
yielding the separate unmerged a.MERGE.5 and a.MERGE.4 internal keys (Note a.RANGEDEL.4 does not delete a.MERGE.4 because they share the same sequence number.)
The sameStripeNonSkippable fell out of the fact that range deletions were previously interleaved by the input iterator at their start key with their sequence number. These sameStripeNonSkippable could interrupt any logic iterating across the keys of a stripe, complicating the compaction iterator logic.
In some instances INVALID keys were also returned with sameStripeNonSkippable. These keys are now treated similarly to other errors encountered during iteration: i.err is set and newStripeNewKey is returned.
Close #3082.