storage: add range key support for SST iterators and writers#82799
storage: add range key support for SST iterators and writers#82799craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
12a7f24 to
5984d9f
Compare
5984d9f to
532d334
Compare
|
Looks like This is unnecessary, and should be avoided. However, it will require exposing range keys via |
nicktrav
left a comment
There was a problem hiding this comment.
Still reading - flushing some comments.
Reviewed 4 of 11 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/sst_writer.go line 172 at r2 (raw file):
return errors.Wrapf(err, "failed to encode MVCC value for range key %s", rangeKey) } fw.DataSize += int64(len(rangeKey.StartKey)) + int64(len(rangeKey.EndKey)) + int64(len(valueRaw))
We don't account for the timestamp in the size? I see there's no change from the existing functionality - just curious. Is it more like "user data", and doesn't care about the on-disk size?
pkg/storage/sst_writer.go line 183 at r2 (raw file):
func (fw *SSTWriter) ExperimentalClearMVCCRangeKey(rangeKey MVCCRangeKey) error { if !fw.supportsRangeKeys { return nil // noop
I assume this is to avoid having litter the client code with if supportsRangeKey checks? Instead we'll unconditionally call this method and handle the switching in the one place.
Given we're not panic-ing, is there any risk of subtle errors creeping in here - say some code was expecting the table to have range keys, but it didn't, so we just chugged along. Maybe that's ok.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/sst_writer.go line 172 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
We don't account for the timestamp in the size? I see there's no change from the existing functionality - just curious. Is it more like "user data", and doesn't care about the on-disk size?
Doesn't seem like it, no -- that bugged me a bit too. But I didn't want to get sidetracked by potentially breaking API changes and stuff here, so decided to just go with the status quo. We can address that separately if need be.
pkg/storage/sst_writer.go line 183 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I assume this is to avoid having litter the client code with
if supportsRangeKeychecks? Instead we'll unconditionally call this method and handle the switching in the one place.Given we're not panic-ing, is there any risk of subtle errors creeping in here - say some code was expecting the table to have range keys, but it didn't, so we just chugged along. Maybe that's ok.
It's because the code often doesn't have access to e.g. settings to make this determination, and doesn't really care either.
Usually this is called via some other method, e.g. ClearRawRange(), which then calls through to ExperimentalClearMVCCRangeKey(). The original caller doesn't care whether we drop a tombstone or not, it only cares that we delete everything in the span -- if we don't support range keys, then there cannot be any range keys there to clear, so we can omit the tombstone and everyone's happy. However, we need ClearRawRange to work both before and after range keys are enabled.
Note that PutMVCCRangeKey is a different story -- in that case the original caller definitely needs to care about whether range keys are supported, and we'll error otherwise. But not clearing range keys when there cannot exist any range keys to clear gives the intended outcome anyway.
That said, this all makes sense in the Engine case (where this code was lifted from), since we have global synchronization via the local Pebble instance. The SSTWriter is subtly different, in that it may actually matter whether we drop a RangeKeyDelete into an SST that's then ingested on a different node (who may not have had its version gate flipped yet). This is controlled by EnablePebbleFormatVersionRangeKeys, which does a migration to make sure everyone supports these SSTs.
I've looked over the uses of MakeIngestionSSTWriter, and it seems like in the places that matter (e.g. when applying Raft snapshots and during range merges and such) we're always building and applying the SSTs locally under an exclusive lock. So I don't think there can be a race there where we start building the SST, then the version gate flips, someone writes a range key, and we fail to clear it. It's worth confirming this, but I don't think we need to block this PR on it, and it's unclear how it'd affect the SSTWriter in any case -- it'd have to rely on external synchronization, I think.
Am I missing anything?
532d334 to
bb295ac
Compare
nicktrav
left a comment
There was a problem hiding this comment.
Flushing again - I'm picking through the test cases slowly.
Reviewed 3 of 11 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/sst_writer.go line 172 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Doesn't seem like it, no -- that bugged me a bit too. But I didn't want to get sidetracked by potentially breaking API changes and stuff here, so decided to just go with the status quo. We can address that separately if need be.
sgtm
pkg/storage/sst_writer.go line 183 at r2 (raw file):
Thanks for the detail.
Am I missing anything?
I don't think so. I think this is fine.
pkg/storage/testdata/mvcc_histories/sst_iter line 5 at r2 (raw file):
# separately. # # TODO(erikgrinaker): Add overlapping range keys when it doesn't panic.
I assume this is a Cockroach-level limitation? The pebble writer should already support this.
pkg/storage/testdata/mvcc_histories/sst_iter line 12 at r2 (raw file):
sst_put k=b ts=1 v=b1 sst_put k=c ts=2 sst_put_rangekey k=b end=f ts=4
Worth interleaving the rangekeys and point keys a bit to highlight the fact that they don't necessarily need to be added after all the point keys have been added? i.e. this range key could be the first key added and it should still produce the same output.
pkg/storage/testdata/mvcc_histories/sst_iter_multi line 16 at r2 (raw file):
# 4. Write a new range key that partially replaces one in 3 with different localTs. # 5. Clear the [h-j) span. #
Worth mentioning that the output of this test command has the tables in reverse order?
pkg/storage/testdata/mvcc_histories/sst_iter_multi line 41 at r2 (raw file):
---- >> at end: >> sst-0:
Should we have some way of representing a range key del / range del?
pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):
# Writing the same key multiple times uses the first key. # # TODO(erikgrinaker): Is this expected?
This surprised me too - but looking at the raw sstable that's written, I see the two keys.
nicktrav
left a comment
There was a problem hiding this comment.
Nice! Nothing major other than the few questions / observations about the test cases.
Reviewed 4 of 11 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/testdata/mvcc_histories/sst_writer line 66 at r2 (raw file):
# Clearing a span does not affect SST contents. # # TODO(erikgrinaker): Is this correct?
I left a comment elsewhere - it would be nice to see the range dels / range key dels in the output, given there are physical keys in the sstable. Maybe something like a "raw iteration mode", that uses just a boring old sstable.Reader for these sst tests?
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/testdata/mvcc_histories/sst_iter line 5 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I assume this is a Cockroach-level limitation? The pebble writer should already support this.
No, it's cockroachdb/pebble#1760
pkg/storage/testdata/mvcc_histories/sst_iter line 12 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Worth interleaving the rangekeys and point keys a bit to highlight the fact that they don't necessarily need to be added after all the point keys have been added? i.e. this range key could be the first key added and it should still produce the same output.
Yeah, may as well.
pkg/storage/testdata/mvcc_histories/sst_iter_multi line 16 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Worth mentioning that the output of this test command has the tables in reverse order?
Yeah, maybe it's better to keep them oldest-to-newest, as they're written. This shows them in the order they're fed to NewExternalIter, and earlier SSTs take precedence over latter SSTs, so we have to reverse them to get the expected order where later writes replace earlier writes -- but I think we should optimize for test case understandability here, even though the API may use them differently.
pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
This surprised me too - but looking at the raw sstable that's written, I see the two keys.
I think it's just because it uses the first key it finds while iterating.
pkg/storage/testdata/mvcc_histories/sst_writer line 66 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
I left a comment elsewhere - it would be nice to see the range dels / range key dels in the output, given there are physical keys in the sstable. Maybe something like a "raw iteration mode", that uses just a boring old
sstable.Readerfor these sst tests?
Probably a good idea. I figured that'd be an implementation detail at this level, but would be useful to dump them.
bb295ac to
5658912
Compare
nicktrav
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)
5658912 to
66bbb6e
Compare
63ecb99 to
760d847
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Looks like TestStoreRangeMergeRaftSnapshot is failing here, because ClearRangeWithHeuristic unconditionally drops a Pebble range tombstones across the range key span, thus generating additional SSTs when clearing range data.
This is unnecessary, and should be avoided. However, it will require exposing range keys via EngineIterator first, see #82935.
Had to pull in #83031 for this, sorry about the noise. For now, I've made a quick hack in ClearRangeWithHeuristic to get tests to pass, as a separate commit. Wrote up #83032 to revisit the range clearing APIs and heuristics.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I think it's just because it uses the first key it finds while iterating.
Can you confirm that this behavior makes sense? I think this might be what makes a rangedel ineffectual as well, since it sees the SET before the RANGEDEL, or some such. Have a look at this case:
pkg/storage/testdata/mvcc_histories/sst_writer line 66 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Probably a good idea. I figured that'd be an implementation detail at this level, but would be useful to dump them.
Done. This is much better, good idea.
760d847 to
c970e69
Compare
nicktrav
left a comment
There was a problem hiding this comment.
Did another pass, and addressed the open comments.
Reviewed 2 of 8 files at r6, 3 of 6 files at r7, 1 of 1 files at r8, 11 of 11 files at r9, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aliher1911, @erikgrinaker, and @jbowens)
pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Can you confirm that this behavior makes sense? I think this might be what makes a rangedel ineffectual as well, since it sees the SET before the RANGEDEL, or some such. Have a look at this case:
Range dels with sequence number m only apply to keys with seqnum n if n < m (see here). As all the keys in the table are written out with the same sequence number, the range del wont apply.
That's also why we see the rangekeyset and the rangekeydel next to each other over the span c-e (see here).
In the case you write out a new table with just the sst_clear_range and have that shadow the first table, it should take effect. I think you already have coverage for this in sst_iter_multi, but I guess you could just put another sst_finish after the sst_clear_range to put that in a standalone table?
pkg/storage/testdata/mvcc_histories/sst_writer line 50 at r9 (raw file):
iter_scan: "a"/1.000000000,0=/BYTES/a1 {a-c}/[3.000000000,0={localTs=2.000000000,0}/<empty>] iter_scan: . >> at end:
Super nitpicky, but I was expecting to see the raw sstable output first, then followed by the scan output, given we do sst_finish, followed by sst_iter_new. Maybe there's something subtle in the test harness that prevents that, so not a big issue.
This patch adds a quick hack to `ClearRangeWithHeuristic` to avoid dropping a Pebble range tombstone across the range key span if there are no range keys. Similar heuristics exist in `Pebble.ExperimentalClearAllRangeKeys`, but we may be writing to an `SSTWriter` here which can't have these heuristics. This scheme needs to be revamped, but for now we add this quick hack to pass tests and unblock `SSTWriter` range key support. Release note: None
This patch adds `NewPebbleSSTIterator()`, which constructs an MVCC iterator for SSTs. Unlike the existing `sstIterator`, it supports range keys and merged iteration across multiple SSTs. It is based on a new `pebble.NewExternalIter()` API for SST iteration, and reuses `pebbleIterator` for the CRDB logic. Value verification has been split out to a separate, general `VerifyingMVCCIterator`. This iterator currently has significantly worse performance than the existing SST iterator (as much as 100%), so it is not used yet outside of tests. This will be optimized later. The patch also adds support for writing range keys in `SSTWriter`. This is only enabled when the SST format is `TableFormatPebblev2` or newer, which is only the case once the cluster version reaches `EnablePebbleFormatVersionRangeKeys`. Release note: None
c970e69 to
cf162f9
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aliher1911, @jbowens, and @nicktrav)
pkg/storage/testdata/mvcc_histories/sst_writer line 37 at r2 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Range dels with sequence number
monly apply to keys with seqnumnifn < m(see here). As all the keys in the table are written out with the same sequence number, the range del wont apply.That's also why we see the
rangekeysetand therangekeydelnext to each other over the spanc-e(see here).In the case you write out a new table with just the
sst_clear_rangeand have that shadow the first table, it should take effect. I think you already have coverage for this insst_iter_multi, but I guess you could just put anothersst_finishafter thesst_clear_rangeto put that in a standalone table?
Thanks, that makes sense. Added test cases both for same SST and different SST.
pkg/storage/testdata/mvcc_histories/sst_writer line 50 at r9 (raw file):
Previously, nicktrav (Nick Travers) wrote…
Super nitpicky, but I was expecting to see the raw sstable output first, then followed by the scan output, given we do
sst_finish, followed bysst_iter_new. Maybe there's something subtle in the test harness that prevents that, so not a big issue.
This is an artifact of the test harness -- it outputs the persisted state at the end of each test, after commands have been run. It can output the intermediate states after each mutation by using run trace, but that quickly gets noisy.
Split the test into two instead: one which sets up the data, and one which iterates.
|
TFTR! bors r=nicktrav |
|
Build succeeded: |
This patch adds
NewPebbleSSTIterator(), which constructs an MVCCiterator for SSTs. Unlike the existing
sstIterator, it supports rangekeys and merged iteration across multiple SSTs. It is based on a new
pebble.NewExternalIter()API for SST iteration, and reusespebbleIteratorfor the CRDB logic. Value verification has been splitout to a separate, general
VerifyingMVCCIterator.This iterator currently has significantly worse performance than the
existing SST iterator (as much as 100%), so it is not used yet outside
of tests. This will be optimized later.
The patch also adds support for writing range keys in
SSTWriter. Thisis only enabled when the SST format is
TableFormatPebblev2or newer,which is only the case once the cluster version reaches
EnablePebbleFormatVersionRangeKeys.Resolves #82586.
Release note: None