rangefeed: use time-bound iterator for some catchup scans#66312
rangefeed: use time-bound iterator for some catchup scans#66312craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
ae6ece1 to
b3be444
Compare
|
This looks right to me. I wonder if there's any reason to have any hesitation. @nvanbenschoten any reason for hesitation here that you can think of? |
nvb
left a comment
There was a problem hiding this comment.
This seems safe in theory, but I don't understand the current state of MVCCIncrementalIterator enough to vouch for the safety of the use of MinTimestampHint without a corresponding non-time bound iterator. I remember some concern from years ago about an intent's metadata key and its provisional value key being split between SSTs and then iteration considering the provisional value to be a committed value. I don't know whether this is still possible.
Perhaps @adityamaru or @pbardea could weigh-in, as they have touched MVCCIncrementalIterator fairly recently.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @miretskiy)
That sounds very bad indeed. It looks like there was an attempt to address just that here: but I'll need to dig into it to see if that fix also covers the case of time-bound iterators. |
|
I wrote a version of
I took to mean that in the case of separated intents, we wouldn't see the intent and also wouldn't see the provisional value. But now I worry it means we won't see the intents but will see the provisional values. #43799 also has lot of conversation I should probably peruse as well. I can see if NewMVCCIncrementalIterator provides a similar performance improvement as using the tbi alone; |
|
From what I can tell, this is going to be wrong for separated intents and it makes me pretty nervous that we don't have a test that failed here. For example, if I create a database with a few keys and 1 pending transaction and a TBI on that DB with the settings here, and if separated intents are off, then we still end up getting the intent and the uncommitted key ("c"): But, with separated intents on: we see key "c" but with no indication of the intent. In retrospect, this is exactly what that comment told me (and is somewhat clear now that I understand a bit more about these APIs), I just let optimism get the better of me. So, as-is, this definitely isn't usable here unless I'm missing something. I can look into making something like MVCCIncrementalIterator but with the semantics needed for the catch-up scan in rangefeeds (still experimenting here to see if anything needs to change. I think we just need to be careful with the inclusive vs exclusive time bounds and potentially provide some different behaviour around intent handling). |
9d5f89b to
d2e33f7
Compare
|
I've redone this based on MVCCIncrementalIterator. Unfortunately, we can only turn this optimization on in some cases since the withDiff option to rangefeeds requires that we see values from arbitrarily in the past. |
What's unfortunate is that the |
Yup. I've started in on this but wanted to at least get this first slice of work out. |
9fe409e to
5928791
Compare
5928791 to
9a8248c
Compare
|
I think this should be ready for a second look. I added an additional iterator to handle the previous values and added some testing for the intent and inline policies. |
9a8248c to
60ae339
Compare
ajwerner
left a comment
There was a problem hiding this comment.
Are we trying to get this in for 21.2?
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @nvanbenschoten)
|
@ajwerner I would like that. I'm working on some other emergent bugs at the moment, but the TODO list is
|
60ae339 to
f0d5f9d
Compare
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 5 of 7 files at r12.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
2fd4842 to
d8be2a6
Compare
d8be2a6 to
0c2d082
Compare
|
Would love it if we could get this PR in for the next release. Given that slow catchup-scans have caused a number of customer-facing issues and given that this is default off, I feel like we could even make an argument for it going in during stability. Also, I've opened a PR for the withDiff support: #69191 |
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 3 of 7 files at r10, 4 of 7 files at r12, 1 of 1 files at r13, 3 of 3 files at r14.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @stevendanna)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 119 at r14 (raw file):
if ok, err := i.Valid(); err != nil { return err } else if !ok || !i.UnsafeKey().Less(endKey) {
Iterators are required to enforce the upper bound, so this additional comparison is unnecessary,
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 149 at r14 (raw file):
// If write is inline, it doesn't have a timestamp so we don't // filter on the registration's starting timestamp. Instead, we // return all inline writes.
We may not see inline meta values that have been updated since the last time this rangefeed was active, as the catchup scan using TBI could miss them. Is that acceptable?
Also, do we really want to return all inline meta, even if they have not changed?
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 169 at r14 (raw file):
// or before the registration's (exclusive) starting timestamp. ignore := !(ts.IsEmpty() || catchupTimestamp.Less(ts)) if ignore && !withDiff {
Can we move this if-block before we do the copying due to sameKey being false? If most of the keys we are encountering only have old versions, we would save many unnecessary key copies.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 178 at r14 (raw file):
var val []byte a, val = a.Copy(unsafeVal, 0) if withDiff {
if there is no update to a key within the time interval of the catchup scan, there is no diff, yes? So we are unnecessarily copying the val just to pass it to addPrevToLastEvent which will ignore it because reorderBuf is empty?
pkg/storage/mvcc_incremental_iterator.go, line 120 at r14 (raw file):
// MVCCIncrementalIterIntentPolicyEmit will return intents to // the caller if they are inside the time range. Intents // outside of the time range will be filtered without error.
In this emit case, which is what we are using in the catchup scan, it seems we will parse MVCCMetadata twice, in here, and in the caller. Is there a reason the MVCCIncrementatlIterator implementation needs to promise to filter in this case -- it could just return the intent to the caller and let it deal with it?
It is ok to leave this optimization as a TODO for the future.
pkg/storage/mvcc_incremental_iterator.go, line 527 at r14 (raw file):
} // We have a valid KV.
or an intent to emit.
pkg/util/constants.go, line 119 at r14 (raw file):
func ConstantWithMetamorphicTestBool(name string, defaultValue bool) bool { if metamorphicBuild { if rng.Float64() < metamorphicValueProbability {
metamorphicValueProbability is 0.75, so it will usually return !defaultValue. I don't quite see why the comment says "occasionally".
And can't we just use ConstantWithMetamorphicTestValue(name, defaultValue, !defaultValue) instead of defining this function?
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 149 at r14 (raw file):
We may not see inline meta values that have been updated since the last time this rangefeed was active, as the catchup scan using TBI could miss them. Is that acceptable?
This is a good point. In the non-TBI case, you can miss intermediate values (if the value was updated more than once), but the range feed would at least send you the most recent version.
From the perspective of changefeeds I think it is acceptable to miss inline values because we only handle user data which should all be MVCC and so we shouldn't be seeing inline values at all.
One short-term option would be to have a per-rangefeed setting that the caller uses to opt-into TBI. Then we could turn it on for changefeeds without turning it on for other users who might care about inline values (although it isn't clear to me that there are any, see below). I had this in an earlier iteration of this and could dig it back up. I think this would be my preference.
Also, do we really want to return all inline meta, even if they have not changed?
This was the existing behaviour which I've left. Outside of changefeeds I believe the other current users of rangefeeds are: Settings Watcher (following the settings table), Lease Manager (following the descriptor table), and the Stats Cache (following the statistics table). From my reading of the code, those spans should not have inline values in them.
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @stevendanna)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 149 at r14 (raw file):
Previously, stevendanna (Steven Danna) wrote…
We may not see inline meta values that have been updated since the last time this rangefeed was active, as the catchup scan using TBI could miss them. Is that acceptable?
This is a good point. In the non-TBI case, you can miss intermediate values (if the value was updated more than once), but the range feed would at least send you the most recent version.
From the perspective of changefeeds I think it is acceptable to miss inline values because we only handle user data which should all be MVCC and so we shouldn't be seeing inline values at all.
One short-term option would be to have a per-rangefeed setting that the caller uses to opt-into TBI. Then we could turn it on for changefeeds without turning it on for other users who might care about inline values (although it isn't clear to me that there are any, see below). I had this in an earlier iteration of this and could dig it back up. I think this would be my preference.
Also, do we really want to return all inline meta, even if they have not changed?
This was the existing behaviour which I've left. Outside of changefeeds I believe the other current users of rangefeeds include are: Settings Watcher (following the settings table), Lease Manager (following the descriptor table), and the Stats Cache (following the statistics table). From my reading of the code, those spans should not have inline values in them.
Thanks for the details. I don't have much understanding of the various rangefeed use cases, so as long as folks who do understand agree with this reasoning, fine with me.
I do think it is worth specifying the difference in semantics of what the rangefeed provides in these 2 cases in a code comment somewhere.
ajwerner
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 149 at r14 (raw file):
Previously, sumeerbhola wrote…
Thanks for the details. I don't have much understanding of the various rangefeed use cases, so as long as folks who do understand agree with this reasoning, fine with me.
I do think it is worth specifying the difference in semantics of what the rangefeed provides in these 2 cases in a code comment somewhere.
Today all spans watched are real, live SQL tables and never see inline values.
0c2d082 to
63584c7
Compare
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)
a discussion (no related file):
TFTR!
I've left the handling of inline values as-is for right now. I've opened an issue (#69357) to look into whether we can remove support for them.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 119 at r14 (raw file):
Previously, sumeerbhola wrote…
Iterators are required to enforce the upper bound, so this additional comparison is unnecessary,
Done. I'm not sure why this was added originally, but currently it looks like the main use for this is that we have a test iterator implementation that doesn't take an upper bound and there is an existing test to make sure that keys are still not emitted in this case. I've updated the testIterator to respect an upperBound if set and removed this comparison. It appears to be a nice win.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 169 at r14 (raw file):
Previously, sumeerbhola wrote…
Can we move this if-block before we do the copying due to sameKey being false? If most of the keys we are encountering only have old versions, we would save many unnecessary key copies.
Done. See the second commit message for some benchmark results of this and the other avoided copy.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 178 at r14 (raw file):
Previously, sumeerbhola wrote…
if there is no update to a key within the time interval of the catchup scan, there is no diff, yes? So we are unnecessarily copying the
valjust to pass it toaddPrevToLastEventwhich will ignore it becausereorderBufis empty?
Done. See the second commit message for some benchmark results of this and the other avoided copy.
pkg/storage/mvcc_incremental_iterator.go, line 120 at r14 (raw file):
Previously, sumeerbhola wrote…
In this emit case, which is what we are using in the catchup scan, it seems we will parse
MVCCMetadatatwice, in here, and in the caller. Is there a reason theMVCCIncrementatlIteratorimplementation needs to promise to filter in this case -- it could just return the intent to the caller and let it deal with it?
It is ok to leave this optimization as a TODO for the future.
I've left this as a TODO. ONe thing I would want to double check and test here is that we always return the following key if the intent was returned.
pkg/storage/mvcc_incremental_iterator.go, line 527 at r14 (raw file):
Previously, sumeerbhola wrote…
or an intent to emit.
Done.
pkg/util/constants.go, line 119 at r14 (raw file):
metamorphicValueProbability is 0.75, so it will usually return !defaultValue. I don't quite see why the comment says "occasionally".
I've updated the comment and changed the probabilities for bool's to 0.50 which seems more appropriate.
And can't we just use ConstantWithMetamorphicTestValue(name, defaultValue, !defaultValue) instead of defining this function?
I don't think we can without changing the argument and return types of that function which I think would be annoying for most other users? There is no implicit conversion/cast between bool and int. Or have I misunderstood something here?
|
Anything else we would like to see here? I'm happy to keep plugging away at it. |
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 1 of 2 files at r11, 1 of 7 files at r12, 3 of 3 files at r15, 5 of 5 files at r16.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
a discussion (no related file):
Previously, stevendanna (Steven Danna) wrote…
TFTR!
I've left the handling of inline values as-is for right now. I've opened an issue (#69357) to look into whether we can remove support for them.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 155 at r16 (raw file):
// miss inline values completely and normal // iterators may result in the rangefeed not // seeing some intermeidate values.
intermediate
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 182 at r16 (raw file):
lastKey.Timestamp = unsafeKey.Timestamp
Would help to have something like
// INVARIANT: !ignore || withDiff
// Cases
// - !ignore: we need to copy the unsafeVal to add to the reorderBuf, regardless of the value of withDiff
// - withDiff && ignore: we need to copy the unsafeVal only if there is already something in the reorderBuf for which we need to set the previous value.
// The common case will be an empty reorderBuf.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 188 at r16 (raw file):
} if withDiff {
Can you hoist this if-block into the block where we create the copy of unsafeVal. And same for the else-below which uses val. Then we can move the declaration val into the if-block itself and it will be obvious to the reader that the subsequent uses of val are working with an initialized val. Currently one needs to look at the various predicates to be sure of correctness.
pkg/kv/kvserver/rangefeed/catchup_scan_bench_test.go, line 43 at r16 (raw file):
EndKey: endKey, } // fmt.Println(eng.GetMetrics().String())
stale debugging code
pkg/kv/kvserver/rangefeed/catchup_scan_bench_test.go, line 99 at r16 (raw file):
// low and also return a read only engine to prevent // read-based compactions after the initial data // generation.
can you add something like the following to this comment:
// This case is trying to simulate a larger store, but with fewer bytes. If we did not reduce LBaseMaxBytes, almost all data would be in Lbase or L6, and TBI would be ineffective. By reducing LBaseMaxBytes, the data should spread out over more levels, like in a real store. The LSM state depicted below shows that this was only partially successful.
pkg/kv/kvserver/rangefeed/catchup_scan_bench_test.go, line 241 at r16 (raw file):
if opts.randomKeyOrder { // Randomize the order in which the keys are written.
can you use rng.Shuffle?
pkg/util/constants.go, line 119 at r14 (raw file):
Or have I misunderstood something here?
never mind. I am not sure what I was thinking.
63584c7 to
ac2365f
Compare
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 155 at r16 (raw file):
Previously, sumeerbhola wrote…
intermediate
Done.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 182 at r16 (raw file):
Previously, sumeerbhola wrote…
Would help to have something like
// INVARIANT: !ignore || withDiff // Cases // - !ignore: we need to copy the unsafeVal to add to the reorderBuf, regardless of the value of withDiff // - withDiff && ignore: we need to copy the unsafeVal only if there is already something in the reorderBuf for which we need to set the previous value. // The common case will be an empty reorderBuf.
Done.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 188 at r16 (raw file):
Previously, sumeerbhola wrote…
Can you hoist this if-block into the block where we create the copy of
unsafeVal. And same for the else-below which usesval. Then we can move the declarationvalinto the if-block itself and it will be obvious to the reader that the subsequent uses ofvalare working with an initializedval. Currently one needs to look at the various predicates to be sure of correctness.
Done.
pkg/kv/kvserver/rangefeed/catchup_scan_bench_test.go, line 43 at r16 (raw file):
Previously, sumeerbhola wrote…
stale debugging code
Done.
pkg/kv/kvserver/rangefeed/catchup_scan_bench_test.go, line 99 at r16 (raw file):
Previously, sumeerbhola wrote…
can you add something like the following to this comment:
// This case is trying to simulate a larger store, but with fewer bytes. If we did not reduce LBaseMaxBytes, almost all data would be in Lbase or L6, and TBI would be ineffective. By reducing LBaseMaxBytes, the data should spread out over more levels, like in a real store. The LSM state depicted below shows that this was only partially successful.
Done.
pkg/kv/kvserver/rangefeed/catchup_scan_bench_test.go, line 241 at r16 (raw file):
Previously, sumeerbhola wrote…
can you use
rng.Shuffle?
Looks like it. Done.
sumeerbhola
left a comment
There was a problem hiding this comment.
Reviewed 4 of 4 files at r17.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
nvb
left a comment
There was a problem hiding this comment.
This seems like a great change to me. I didn't scrutinize the logic in (*CatchupIterator).CatchupScan, but I trust that it has had plenty of eyes on it.
Reviewed 1 of 7 files at r10, 3 of 7 files at r12, 1 of 1 files at r13, 3 of 3 files at r15, 4 of 5 files at r16, 4 of 4 files at r17.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 73 at r15 (raw file):
} type outputEventFn func(e *roachpb.RangeFeedEvent) error
Could we give this a comment that mentions ownership over the provided *RangeFeedEvent. CatchupScan assumes that it retains ownership of the *RangeFeedEvent after the call, but it's not clear what the contract is for interior memory.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 47 at r17 (raw file):
if useTBI && !args.WithDiff { ret.SimpleMVCCIterator = storage.NewMVCCIncrementalIterator(reader, storage.MVCCIncrementalIterOptions{ EnableTimeBoundIteratorOptimization: true,
Out of curiosity, is the branching logic in this function equivalent to passing EnableTimeBoundIteratorOptimization: useTBI && !args.WithDiff?
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 49 at r17 (raw file):
EnableTimeBoundIteratorOptimization: true, EndKey: args.Span.EndKey, // StartTime is exclusive
nit: Say a bit more here. StartTime is exclusive but args.Timestamp is inclusive.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 52 at r17 (raw file):
StartTime: args.Timestamp.Prev(), EndTime: hlc.MaxTimestamp, IntentPolicy: storage.MVCCIncrementalIterIntentPolicyEmit,
I think it would be valuable to justify the use of both of these policies and explain why CatchupScan wants to see both intents and inline values.
pkg/storage/mvcc_incremental_iterator.go, line 343 at r17 (raw file):
// The method sets i.err with the error for future processing. func (i *MVCCIncrementalIterator) initMetaAndCheckForIntentOrInlineError() error { i.meta.Reset()
Any reason for this movement? protoutil.Unmarshal already call Reset.
pkg/storage/mvcc_incremental_iterator.go, line 397 at r17 (raw file):
return nil case MVCCIncrementalIterIntentPolicyEmit: // We will emit this intent to the caller
nit: punctuation.
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @miretskiy, @nvanbenschoten, @stevendanna, and @sumeerbhola)
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 73 at r15 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could we give this a comment that mentions ownership over the provided
*RangeFeedEvent.CatchupScanassumes that it retains ownership of the*RangeFeedEventafter the call, but it's not clear what the contract is for interior memory.
I've added a few words but as a TODO. I think we likely want to think more carefully about the buffer management we do in this function: #69357. Ideally the contract would be that outputEventFn assumes the caller owns the memory and that it may become invalid after outputEventFn returns. That way we we would have a clear point at which we could reset our buffer so that it doesn't grow unbounded throughout the catchup scan.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 47 at r17 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Out of curiosity, is the branching logic in this function equivalent to passing
EnableTimeBoundIteratorOptimization: useTBI && !args.WithDiff?
Yes, I believe so, but my original goal (before the subsequent improvements to the CatchupScan code was to make it clear that the old code path was relatively unchanged when this option was not enabled.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 49 at r17 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: Say a bit more here.
StartTimeis exclusive butargs.Timestampis inclusive.
Done.
pkg/kv/kvserver/rangefeed/catchup_scan.go, line 52 at r17 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it would be valuable to justify the use of both of these policies and explain why
CatchupScanwants to see both intents and inline values.
Done.
pkg/storage/mvcc_incremental_iterator.go, line 343 at r17 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Any reason for this movement?
protoutil.Unmarshalalready callReset.
Ah, did not know Unmarshal called Reset. Moved back.
pkg/storage/mvcc_incremental_iterator.go, line 397 at r17 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: punctuation.
Done.
c93d769 to
d892624
Compare
We've been seeing periodic failures near the end of cdc/ledger tests. During these failures, our changefeed's high-watermark falls too far behind realtime. We believe that this is the result of a range-split that happens when the `entry` table reaches > 512 MiB. The range split results in a catch up scan which takes considerable time. We've seen similar behaviour in customer installations where we've observed rangefeed catch up scan taking many minutes. Using the IncrementalIterator, our cdc/ledger test survives the range-split with no substantial increase in our changefeed latency. The existing IncrementalIterator assumed that any intent or inline values inside the range were errors. To fully support range-feeds, we want to see both of these. To facilitate this, we add two new configuration options to the IncrementalIterator, IntentPolicy and InlinePolicy. The code for the catch up scan has been extracted into its own method to make it easier to benchmark. Below are some benchmark results. Here, we see that in our best case, TBI improves catch up scan performance substantially; however, there are some worst-cases where it doesn't. We've opted to make this a cluster option that is disable by default for now -- allowing us to recommend enabling it for customers where we think it will have a positive impact. ``` goos: darwin goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz BenchmarkCatchupScan BenchmarkCatchupScan/linear-keys BenchmarkCatchupScan/linear-keys/useTBI=true BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=0.00 BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=0.00-16 2 683491479 ns/op BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=50.00 BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=50.00-16 4 328828152 ns/op BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=75.00 BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=75.00-16 7 160052262 ns/op BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=95.00 BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=95.00-16 39 28729009 ns/op BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=99.00 BenchmarkCatchupScan/linear-keys/useTBI=true/withDiff=false/perc=99.00-16 57 19237201 ns/op BenchmarkCatchupScan/linear-keys/useTBI=false BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=0.00 BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=0.00-16 3 408734474 ns/op BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=50.00 BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=50.00-16 4 329713989 ns/op BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=75.00 BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=75.00-16 4 286746010 ns/op BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=95.00 BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=95.00-16 4 255228312 ns/op BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=99.00 BenchmarkCatchupScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16 4 259558576 ns/op BenchmarkCatchupScan/random-keys BenchmarkCatchupScan/random-keys/useTBI=true BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=0.00 BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=0.00-16 2 697607634 ns/op BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=50.00 BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=50.00-16 2 577198422 ns/op BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=75.00 BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=75.00-16 2 543513832 ns/op BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=95.00 BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=95.00-16 3 483985331 ns/op BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=99.00 BenchmarkCatchupScan/random-keys/useTBI=true/withDiff=false/perc=99.00-16 3 456677720 ns/op BenchmarkCatchupScan/random-keys/useTBI=false BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=0.00 BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=0.00-16 3 417899254 ns/op BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=50.00 BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=50.00-16 3 347611040 ns/op BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=75.00 BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=75.00-16 4 310572985 ns/op BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=95.00 BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=95.00-16 4 264969646 ns/op BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=99.00 BenchmarkCatchupScan/random-keys/useTBI=false/withDiff=false/perc=99.00-16 4 300928994 ns/op BenchmarkCatchupScan/mixed-case BenchmarkCatchupScan/mixed-case/useTBI=true BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=0.00 BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=0.00-16 2 668092698 ns/op BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=50.00 BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=50.00-16 2 556421222 ns/op BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=75.00 BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=75.00-16 2 501923714 ns/op BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=95.00 BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=95.00-16 6 187828123 ns/op BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=99.00 BenchmarkCatchupScan/mixed-case/useTBI=true/withDiff=false/perc=99.00-16 6 172837738 ns/op BenchmarkCatchupScan/mixed-case/useTBI=false BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=0.00 BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=0.00-16 3 414819631 ns/op BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=50.00 BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=50.00-16 3 336780253 ns/op BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=75.00 BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=75.00-16 3 344123744 ns/op BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=95.00 BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=95.00-16 4 255228776 ns/op BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=99.00 BenchmarkCatchupScan/mixed-case/useTBI=false/withDiff=false/perc=99.00-16 4 491268277 ns/op PASS ok github.com/cockroachdb/cockroach/pkg/kv/kvserver/rangefeed 75.496s ``` When using the incremental iterator, the previous value of a key might be outside the time window of the iterator. A future PR will extend the incremental iterator to handle this case; but here, we've disabled the optimization if withDiff is specified for the rangefeed. Release note (enterprise change): Improve the performance for CHANGEFEEDs during some range-split operations. Release justification: Performance improvement for catchup scans helps address the cause of recent high-priority customer incidents.
This avoids two copies in the catchup-scan code. In the best case, this provides a nice win even without TBI enabled: ``` name old time/op new time/op delta CatchupScan/linear-keys/useTBI=false/withDiff=true/perc=99.00-16 316ms ± 1% 261ms ± 2% -17.17% (p=0.000 n=23+24) CatchupScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16 257ms ± 2% 226ms ± 3% -12.05% (p=0.000 n=24+23) name old alloc/op new alloc/op delta CatchupScan/linear-keys/useTBI=false/withDiff=true/perc=99.00-16 79.0MB ± 0% 10.4MB ± 0% -86.89% (p=0.000 n=24+25) CatchupScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16 10.4MB ± 0% 2.5MB ± 0% -76.03% (p=0.000 n=20+25) name old allocs/op new allocs/op delta CatchupScan/linear-keys/useTBI=false/withDiff=true/perc=99.00-16 35.8k ± 0% 31.4k ± 0% -12.20% (p=0.000 n=22+25) CatchupScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16 31.4k ± 0% 30.8k ± 0% -2.04% (p=0.000 n=25+25) ``` Release justification: Performance improvement for catchup scans helps address the cause of recent high-priority customer incidents. Release note: None
Iterators already respect the upper bound we set at construction time making this comparison redundant. Removing this comparison provides a nice performance win: ``` name old time/op new time/op delta CatchupScan/linear-keys/useTBI=false/withDiff=true/perc=99.00-16 261ms ± 2% 249ms ± 4% -4.90% (p=0.000 n=24+25) CatchupScan/linear-keys/useTBI=false/withDiff=false/perc=99.00-16 226ms ± 3% 214ms ± 4% -5.55% (p=0.000 n=23+25) ``` Release note: None Release justification: Performance improvement for catchup scans helps address the cause of recent high-priority customer incidents.
d892624 to
0417041
Compare
nvb
left a comment
There was a problem hiding this comment.
Reviewed 8 of 8 files at r18.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @miretskiy, @stevendanna, and @sumeerbhola)
|
bors r=[nvanbenschoten,sumeerbhola] |
|
Build succeeded: |
We've been seeing periodic failures near the end of cdc/ledger
tests. During these failures, our changefeed's high-watermark falls
too far behind realtime. We believe that this is the result of a
range-split that happens when the
entrytable reaches > 512 MiB. Therange split results in a catchup-scan which takes considerable time.
We've seen similar behaviour in customer installations where we've
observed rangefeed catch-up scan taking many minutes.
Using the IncrementalIterator, our cdc/ledger test survives the
range-split with no substantial increase in our changefeed latency.
The existing IncrementalIterator assumed that any intent or inline
values inside the range were errors. To fully support range-feeds, we
want to see both of these. To facilitate this, we add two new
configuration options to the IncrementalIterator, IntentPolicy and
InlinePolicy.
The code for the catch-up scan has been extracted into its own method
to make it easier to benchmark. Below are some benchmark
results. Here, we see that in our best case, TBI improves catch-up
scan performance substantially; however, there are some worst-cases
where it doesn't. We've opted to make this a cluster option that is
disable by default for now -- allowing us to recommend enabling it for
customers where we think it will have a positive impact.
When using the incremental iterator, the previous value of a key might
be outside the time window of the iterator. A future PR will extend
the incremental iterator to handle this case; but here, we've disabled
the optimization if withDiff is specified for the rangefeed.
Release note (enterprise change): Improve the performance for
CHANGEFEEDs during some range-split operations.