-
Notifications
You must be signed in to change notification settings - Fork 4.1k
kv: record non-MVCC operations to support ExportRequests and RangeFeed #62585
Description
Is your feature request related to a problem? Please describe.
The KV layer has two non-MVCC operations which can affect range data: AddSSTable and ClearRange. The SQL layer, in general, goes through great pains to ensure that readers do not read under these operations. These operations, however, have a bad interaction with physical incremental backups and newer streaming asynchronous replication.
The problem for incremental backups, in particular, is AddSSTable. The problem with AddSSTable is that it can rewrite history. In particular, it is possible for an AddSSTable to write keys at timestamps which fall before the end time of the previous incremental backup. These keys will not be picked up by the next incremental. This turns out to be a big correctness problems for indexes which are backed up while they are being built (#62564). To mitigate this correctness problem in the short term, we are going to stop including indexes which are non-public in incremental backups. This will be problematic for use cases where the incremental window in use is short relative to the duration of index backfills.
Streaming replication, based on RangeFeed, has problems with both of these requests. In particular, when we do an AddSSTable, we end up restarting the RangeFeed but the catch-up scan will be from the last resolved timestamp. Any writes in the SST from before that timestamp will not be seen. ClearRange is effectively invisible to the RangeFeed.
Given the importance both of streaming replication and of incremental backups, we need to do something here.
Describe the solution you'd like
The thrust of this proposal is that we should do some bookkeeping in the replicated range state to note when these non-MVCC operations have occurred. We'll use this information to drive correct behavior during ExportRequest with time bounds and during RangeFeed catch up scans.
This issue proposes that we add a new range-local, replicated prefix under which we record, keyed on the timestamp at which non-MVCC operations occurred, the relevant spans they covered, the operation, and other metadata.
The sketch of the keyspace layout will be something like:
// LocalRangeNonMVCCEvents is a prefix in the replicated, range-local keyspace
// for a timestamp-oriented store of the history of non-MVCC operations
LocalRangeNonMVCCEvents = roachpb.Key(makeKey(LocalPrefix, roachpb.RKey("nmvcc")))
The keys which will be used in this keyspace will have the above prefix and a timestamp. This timestamp represents the timestamp at which these operations occurred. I believe this is safe because the server assigns the timestamp and these operations require a single request in the batch [1].
Inside the values we'll store metadata. For a ClearRange we just need the span. For an AddSSTable, we'll want both the span and, at least, the minimum timestamp of any key in the SST. We'll use this information to inform what information to make visible to a time bound iteration. Namely, in addition to the normal keys we'd expose to a time-bound iteration, we should expose all keys in time bounds and spans that may have been covered by AddSSTable events which occurred in that timebound. I'm not totally clear on the interaction of ClearRange on incremental backups but I do think that we'll need to at least know about them for streaming replication and they naturally fit in the same model. To account for the streaming replication, we will probably want to add some new RangeFeed message.
We can GC these records by their timestamp. Today we require for both catch-up scans and export requests that the entire time interval fall into the GC threshold. A related note is that when reading the values in some span and time interval due to an AddSSTable, if the oldest key in the SST is older than the GC Threshold, it should be safe to perform the read at the GC threshold. In fact, limiting the lower bound on the timestamp is, in a sense, an optimization.
All of this, however, does sort of redefine what these lower bounds mean on these various requests. In order to clarify the behavior, all of this interaction with non-MVCC operations should be opt-in. In fact, normal RangeFeeds which do not opt into this behavior should instead fail with an error if such a non-MVCC operation is encountered.
Describe alternatives you've considered
There was a proposal to try to force the SST timestamps to fall before the closed timestamp but that's a tough sell for a variety of reasons.
Additional context
This tracking may also enable mitigation for #31563.
Update: [1] We used to never assign a timestamp to the batch header for AddSSTable but we've very recently (#64023) started to set that to a timestamp at which the relevant read occurred. This is important to ensure that a delete tombstone is not lost to GC, erroneously revealing deleted data.
Epic: CRDB-2624