kvserver: emit AddSSTable events via rangefeeds#73487
kvserver: emit AddSSTable events via rangefeeds#73487craig[bot] merged 3 commits intocockroachdb:masterfrom
AddSSTable events via rangefeeds#73487Conversation
739fdbb to
f85b78a
Compare
miretskiy
left a comment
There was a problem hiding this comment.
Reviewed 10 of 18 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @erikgrinaker)
-- commits, line 7 at r1:
I don't see anywhere in this commit that checks for MVCCAddSSTable setting.
What's the intended use case? Do callers (i.e. rangefeed library users) ought to check
for this setting before enabled OnSST event? Just trying to clarify correct usage.
pkg/kv/kvclient/rangefeed/config.go, line 115 at r1 (raw file):
} // OnSST is called when a SSTable is ingested.
I was going to suggest that perhaps we should drop this option and SST handling since sst handling is not "trivial"...
But maybe just expand the comment to indicate how subtle OnSST needs to be. In particular, the fact that it may
include keys outside span boundaries.
pkg/kv/kvclient/rangefeed/rangefeed.go, line 367 at r1 (raw file):
// FIXME(erikgrinaker): Callers who do not set onSSTable will be oblivious // to these events -- we may want to send such callers an error and force // them to do a catchup scan.
I like the idea of returning an error to cause catchup scan very much.
Or we need to provide a default impl that iterates sst, while ignoring keys outside span boundaries. But
forcing catchup scan sounds a lot better. the caller decides , if they know what they are doing, to ask for SSTs;
otherwise things continue to work.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I don't see anywhere in this commit that checks for MVCCAddSSTable setting.
What's the intended use case? Do callers (i.e. rangefeed library users) ought to check
for this setting before enabled OnSST event? Just trying to clarify correct usage.
It is up to AddSSTable RPC callers to check MVCCAddSSTable before using WriteAtRequestTimestamp. The server-side logic needs to respect WriteAtRequestTimestamp regardless of the version gate in order to handle the non-atomic rollout of version gates across the cluster, as described here:
cockroach/pkg/clusterversion/cockroach_versions.go
Lines 50 to 84 in d27f579
Once that gate starts being activated, all cluster nodes (including rangefeed consumers) will already be running 22.1, so even though they haven't observed the MVCCAddSSTable flip they should be able to consume these events correctly.
pkg/kv/kvclient/rangefeed/config.go, line 115 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I was going to suggest that perhaps we should drop this option and SST handling since sst handling is not "trivial"...
But maybe just expand the comment to indicate how subtle OnSST needs to be. In particular, the fact that it may
include keys outside span boundaries.
We need to pass SSTs across rangefeeds in 22.1 to prepare for streaming cluster-to-cluster replication. Aborting these and forcing catchup-scans on every AddSSTable would likely be too expensive and cause the replication stream to lag significantly (consider that e.g. index builds will be firing off these requests continually for several hours). So I think it'll have to be up to callers to decide if and how they want to handle them.
pkg/kv/kvclient/rangefeed/rangefeed.go, line 367 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I like the idea of returning an error to cause catchup scan very much.
Or we need to provide a default impl that iterates sst, while ignoring keys outside span boundaries. But
forcing catchup scan sounds a lot better. the caller decides , if they know what they are doing, to ask for SSTs;
otherwise things continue to work.
Yeah, it's probably better to error rather than silently ignore them -- I'll just have to audit all of the current internal callers (i.e. non-changefeed users) to make sure they'll actually run catchup-scans when they receive errors.
miretskiy
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @erikgrinaker, and @miretskiy)
pkg/kv/kvclient/rangefeed/config.go, line 115 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
We need to pass SSTs across rangefeeds in 22.1 to prepare for streaming cluster-to-cluster replication. Aborting these and forcing catchup-scans on every
AddSSTablewould likely be too expensive and cause the replication stream to lag significantly (consider that e.g. index builds will be firing off these requests continually for several hours). So I think it'll have to be up to callers to decide if and how they want to handle them.
well, I think the assumption here is that replication stream will use this library; it may very well do so, but that's not a given.
So, I just want to make sure that the existing users of this library behave correctly. That's why I sort of wanted to expand the comment. It's not a simple "kv" even like before.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)
pkg/kv/kvclient/rangefeed/config.go, line 115 at r1 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
well, I think the assumption here is that replication stream will use this library; it may very well do so, but that's not a given.
So, I just want to make sure that the existing users of this library behave correctly. That's why I sort of wanted to expand the comment. It's not a simple "kv" even like before.
Yep, totally. But I think that if rangefeeds are to include SSTs, then those events should be accessible via this library as well. But yeah, we should look into how to best integrate with existing callers (probably error by default, as discussed elsewhere), and I'll make sure to expand the comment.
| Data: sst, | ||
| CRC32: util.CRC32(sst), | ||
| Span: roachpb.Span{Key: start.Key, EndKey: end.Key}, | ||
| EmitRangefeed: args.WriteAtRequestTimestamp, |
There was a problem hiding this comment.
I wonder if it'd make sense to persist this bool as "AtWriteTimestamp" or something like that, and then below instead of skipping non-mvcc AddSSTables, go ahead and send them to the feed until we get to ev.SST != nil and then there, if the SST was not AtWriteTimestamp, fail the feed because we got an unexpected violation of the mvcc semantics that feed is relying on?
There was a problem hiding this comment.
I think it might make sense to call this AtRequestTimestamp and defer the rangefeed policy elsewhere, such that the data describes the SST rather than prescribing downstream actions. Even though it may be a bit unfortunate to bury this policy deep in the bowels of Raft state machine application.
However, given the size of these SSTs, I'd rather not ship all of them through the rangefeed machinery unnecessarily. Also, we only propagate the write timestamp into the rangefeed, with the assumption that all MVCC timestamps in the SST are equal to it:
cockroach/pkg/kv/kvserver/replica_rangefeed.go
Lines 617 to 619 in b3a84a2
It seems unnecessary to add a boolean parameter here, and then assert that the passed parameter is always true. Shouldn't it be the caller's responsibility to only call the method when appropriate? Downstream consumers can (and probably should) also check the actual MVCC timestamps against the claimed write timestamp to assert that MVCC invariants are upheld, if it's cheap enough.
There was a problem hiding this comment.
Then again, maybe we could move the policy into Processor.ConsumeSSTable such that it ignores !AtWriteTimestamp. Seems more appropriate there than in the Raft core.
There was a problem hiding this comment.
My initial suggestion was be to move it further beyond raft, all the way to *RangeFeed.processEvents, where it would be an error if an sst event arrived that was not marked as AtRequestTimestamp. If we were just ignoring them at that point I'd be more concerned about the cost of shipping them all the way through just to discard them, but I was imaging that if we kick out an assertion failure error or something when the very first one arrives, and that tears down the feed, then that cost isn't a big deal since it quickly, and importantly loudly, comes to a stop?
There was a problem hiding this comment.
Ok, but I think I'd prefer to assert that on the server-side at least, e.g. inside the rangefeed processor at Processor.publishSSTable, rather than waiting for it to make its way out to the client and assert it there. In the latter case, we'd have to implement this in all clients (not just the Factory-constructed ones which aren't always used), along with the necessary plumbing.
There was a problem hiding this comment.
Ah, yeah, that make sense to me. The main thing I was interested in is that an unexpected non-mvcc AddSSTable be an error, that shuts the feed down, rather than silently ignored.
There was a problem hiding this comment.
Should we maybe do this for all other non-MVCC operations such ClearRange and RevertRange too? I.e. signal to the rangefeed that we applied a non-MVCC operation, and disconnect it?
There was a problem hiding this comment.
That sounds about right to me, but would want @miretskiy to confirm, and I don't know if I'd hold up this change for it? (whenever we do it, might be good to have an escape-hatch cluster setting to instead ignore the event as we would today in case I'm wrong)
There was a problem hiding this comment.
Done, see separate commit which disconnects rangefeeds with a permanent MVCCHistoryMutationError for operations that mutate MVCC history -- specifically AddSSTable when not writing at the request timestamp, ClearRange, and RevertRange. This does not apply to MVCC-compliant AddSSTable requests that write at the request timestamp, which are emitted across rangefeeds.
Note that for rangefeed.Factory-based clients, this is surfaced via OnInternalError due to the current API design -- unfortunately, most callers do not set this callback, and will silently ignore the error even though the rangefeed stops. The owning teams should add appropriate error handling here.
f85b78a to
b3a84a2
Compare
2a0693d to
8925146
Compare
546689c to
9c733d1
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, and @miretskiy)
pkg/kv/kvclient/rangefeed/config.go, line 115 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yep, totally. But I think that if rangefeeds are to include SSTs, then those events should be accessible via this library as well. But yeah, we should look into how to best integrate with existing callers (probably error by default, as discussed elsewhere), and I'll make sure to expand the comment.
Expanded the comment with relevant details.
pkg/kv/kvclient/rangefeed/rangefeed.go, line 367 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah, it's probably better to error rather than silently ignore them -- I'll just have to audit all of the current internal callers (i.e. non-changefeed users) to make sure they'll actually run catchup-scans when they receive errors.
Done, we run catchup scans unless callers set OnSSTable. This only applies to MVCC-compliant AddSSTable, i.e. with WriteAtRequestTimestamp set. For legacy AddSSTable, we emit a permanent MVCCHistoryMutationError that this client library exposes via OnInternalError (which unfortunately most callers do not currently handle).
3e053bb to
37341d9
Compare
|
I think this should be ready for review now, barring any test failures. |
37341d9 to
b08ca06
Compare
miretskiy
left a comment
There was a problem hiding this comment.
This is very nice. I think @dt and @ajwerner should also do a pass.
Reviewed 3 of 23 files at r3, 9 of 20 files at r5, 1 of 2 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @erikgrinaker, and @miretskiy)
pkg/kv/kvclient/rangefeed/config.go, line 149 at r6 (raw file):
// Also note that AddSSTable requests that do not set the // WriteAtRequestTimestamp flag, possibly writing below the closed timestamp, // will cause affected rangefeeds to be disconnected with MVCCHistoryMutationError
nit: perhaps add a terminal before MVCCHIstoryMutationError?
pkg/kv/kvclient/rangefeed/rangefeed.go, line 283 at r6 (raw file):
start := timeutil.Now() // Note that the below channel send will not block forever because
not even sure which channel this refers anymore. I suspect it's talking about producer/consumer blocking.
pkg/kv/kvclient/rangefeed/rangefeed.go, line 288 at r6 (raw file):
// returning. Because of this, only processEvents may use the cancel // function generated below, since the caller must also consume the errCh // error.
This logic could probably be simplified somewhat by simply creating errCh w/ buffer for 1 error.
Probably something to do for another time.
pkg/kv/kvclient/rangefeed/rangefeed.go, line 293 at r6 (raw file):
errCh <- f.client.RangeFeed(ctx, f.spans, ts, f.withDiff, eventCh) }); err != nil { rfCancel()
This actually makes me a bit sad. I have to jump through the code to verify that we're not leaking context here.
(usually, you always have to call defer rfCancel() immediately after WithCancel -- and actually, I think we we might fail to call rfCancel in some code paths.)
And even if you can convince me that processEvents will always call rfCancel() eventually, I'd say that it's still brittle. To be clear: it's not your code per-say, I think it's because of how this errCh communication was setup.
I wonder why we have to use stopper here at all. We already run this go routine under stopper (with cancel on quiesce). I wonder if another go routine could help us out w/ a more straightforward code.
Something like:
rangefeedTask := func(ctx context.Context) error {
return f.client.RangeFeed(ctx, f.spans, ts, f.withDiff, eventCh)
}
processEventsTask := func(ctx2 context.Context) error {
return f.processEvents(ctx, frontier, eventCh)
}
err := ctxgroup.GoAndWait(ctx, rangefeedTask, processEventsTask)
I don't think we'd need errCh at all in this case. Just return the error from any of the go routines. That cancels the other go routine, and you get the original error.
pkg/kv/kvclient/rangefeed/rangefeed.go, line 365 at r6 (raw file):
rfCancel() <-errCh return errors.New("received unexpected rangefeed SST event with no OnSSTable handler")
do you think errors.AssertionFailedf might be a better match?
pkg/kv/kvserver/replica_application_state_machine.go, line 623 at r6 (raw file):
// expected to ensure that no rangefeeds are currently active across such // spans, but as a safeguard we disconnect the overlapping rangefeeds // with a non-retriable error anyway.
❤️ the comment. Thank you.
pkg/kv/kvserver/replica_application_state_machine.go, line 660 at r6 (raw file):
b.r.writeStats.recordCount(float64(added), 0) } if res.AddSSTable.AtWriteTimestamp {
either here or at handleSSTRaftMuLocked, do we need a TODO comment to add memory accounting/limits?
pkg/kv/kvclient/rangefeed/rangefeed_external_test.go, line 504 at r6 (raw file):
select { case <-checkpointC: case <-time.After(time.Second):
isn't it a bit aggressive? under stress?
b08ca06 to
ca30a07
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @dt, and @miretskiy)
pkg/kv/kvclient/rangefeed/rangefeed.go, line 293 at r6 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
This actually makes me a bit sad. I have to jump through the code to verify that we're not leaking context here.
(usually, you always have to calldefer rfCancel()immediately after WithCancel -- and actually, I think we we might fail to call rfCancel in some code paths.)
And even if you can convince me that processEvents will always call rfCancel() eventually, I'd say that it's still brittle. To be clear: it's not your code per-say, I think it's because of how this errCh communication was setup.I wonder why we have to use stopper here at all. We already run this go routine under stopper (with cancel on quiesce). I wonder if another go routine could help us out w/ a more straightforward code.
Something like:rangefeedTask := func(ctx context.Context) error { return f.client.RangeFeed(ctx, f.spans, ts, f.withDiff, eventCh) } processEventsTask := func(ctx2 context.Context) error { return f.processEvents(ctx, frontier, eventCh) } err := ctxgroup.GoAndWait(ctx, rangefeedTask, processEventsTask)I don't think we'd need errCh at all in this case. Just return the error from any of the go routines. That cancels the other go routine, and you get the original error.
Nice, I quite like this -- updated the code. I definitely share your distaste for the initial attempt, and you're right that it would leak a context.
However, I believe RunAsyncTask does a lot of other stuff too, such as setting up tracing spans. But maybe that doesn't matter as much here. Wdyt @andreimatei?
pkg/kv/kvclient/rangefeed/rangefeed.go, line 365 at r6 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
do you think errors.AssertionFailedf might be a better match?
Yeah, I agree.
pkg/kv/kvserver/replica_application_state_machine.go, line 660 at r6 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
either here or at handleSSTRaftMuLocked, do we need a TODO comment to add memory accounting/limits?
I think it's probably a bit overkill, but added a comment on handleSSTableRaftMuLocked. I'd like to implement this in the near future anyway.
pkg/kv/kvclient/rangefeed/rangefeed_external_test.go, line 504 at r6 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
isn't it a bit aggressive? under stress?
Possibly. Bumped them to 3s.
974f108 to
8b03192
Compare
This patch adds a `testutils/sstutil` package with SST-related test utilities. Release note: None
This patch adds a field `MVCCHistoryMutation` to `ReplicatedEvalResult`, which disconnects any rangefeeds that overlap with the command span. Callers are expected to ensure there are no rangefeeds over such spans, but if they fail to do so it is better to error out rather than silently pretend like nothing happened. The field is set for `AddSSTable`, `ClearRange`, and `RevertRange` requests. This error is exposed via the `OnInternalError` callback for `rangefeed.Factory`-based rangefeed clients. However, if this callback is not set, these clients will silently stop processing events. This is unfortunate, but follows from the API design. Most callers do not appear to set such a callback, and it is left for the owning teams to update the usage of the library with appropriate error handling. Furthermore, this error detection requires knowledge about the new `MVCCHistoryMutationError` error type, so older nodes in mixed-version clusters will simply treat this as a retryable error. Release note: None
8b03192 to
af6a7eb
Compare
This patch emits `AddSSTable` events across rangefeeds when the ingestion was done with `WriteAtRequestTimestamp` enabled (since this ensures they respect the closed timestamp). This setting is introduced with 22.1, and callers must check the `MVCCAddSStable` version gate before using it, so by extension this event is only emitted once the entire cluster runs 22.1. Clients built via `rangefeed.Factory` have a new `WithOnSSTable` option that can be used to register a callback for these events. If such a callback is not set, the rangefeed will run a catchup scan that includes the values written by the `AddSSTable` request. The entire SST is emitted in binary form, regardless of how the registration span overlaps it -- it is up to callers to prune the SST contents as appropriate. Previous values are not included for keys replaced by the SST. Release note: None
af6a7eb to
f048ab0
Compare
|
TFTR! For posterity, @ajwerner and @dt deferred to @miretskiy. bors r=miretskiy |
|
Build succeeded: |
testutils: add sstutil package
This patch adds a
testutils/sstutilpackage with SST-related testutilities.
Release note: None
kvserver: disconnect rangefeeds on MVCC history mutations
This patch adds a field
MVCCHistoryMutationtoReplicatedEvalResult,which disconnects any rangefeeds that overlap with the command span.
Callers are expected to ensure there are no rangefeeds over such spans,
but if they fail to do so it is better to error out rather than silently
pretend like nothing happened. The field is set for
AddSSTable,ClearRange, andRevertRangerequests.This error is exposed via the
OnInternalErrorcallback forrangefeed.Factory-based rangefeed clients. However, if this callbackis not set, these clients will silently stop processing events. This is
unfortunate, but follows from the API design. Most callers do not appear
to set such a callback, and it is left for the owning teams to update
the usage of the library with appropriate error handling. Furthermore,
this error detection requires knowledge about the new
MVCCHistoryMutationErrorerror type, so older nodes in mixed-versionclusters will simply treat this as a retryable error.
Touches #70434.
Release note: None
kvserver: emit AddSSTable events via rangefeeds
This patch emits
AddSSTableevents across rangefeeds when theingestion was done with
WriteAtRequestTimestampenabled (since thisensures they respect the closed timestamp). This setting is introduced
with 22.1, and callers must check the
MVCCAddSStableversion gatebefore using it, so by extension this event is only emitted once the
entire cluster runs 22.1.
Clients built via
rangefeed.Factoryhave a newWithOnSSTableoptionthat can be used to register a callback for these events. If such a
callback is not set, the rangefeed will run a catchup scan that includes
the values written by the
AddSSTablerequest.The entire SST is emitted in binary form, regardless of how the
registration span overlaps it -- it is up to callers to prune the SST
contents as appropriate. Previous values are not included for keys
replaced by the SST.
Resolves #70434.
Release note: None
Initial draft. A few outstanding questions:
Changefeeds will error on these SST events. We currently do not expect to ingest SSTs into online tables, and thus we do not expect changefeeds to see these events. Now that we have MVCC-compliantAddSSTablewe could begin to useAddSSTableinto online tables, and thus the changefeeds would need to handle this (note here that the SST events do not contain previous value diffs), but I figure we can cross that bridge when we get there. Does this make sense?Internal consumers that subscribe viarangefeed.Factorymust register anOnSSTablehandler, otherwise they will not be notified about such events. I'm not sure if we ever expect to ingest SSTs into e.g. online system spans and such, but we may want to consider sending such subscribers an error if they haven't registeredOnSSTablerather than silently ignoring them.These events are likely to be much larger than regular key/value events, and thus there is a greater risk of excessive memory usage since the buffer size per replica is 4096 events. The SSTs are already in memory due to Raft application, but buffering them here can cause them to stick around and pile up. We should consider adding memory accounting here, and use a store-wide memory limit rather than a per-replica fixed-size queue. See: rangefeed: memory accounting and budgeting #73616.