importer/kv: add ImportEpoch field to MVCCValueHeader and write to it during IMPORT INTO#85138
Conversation
5d48309 to
dd1843b
Compare
|
This is ready for a look!
|
|
Looks good at a high level. Wonder if any of the @cockroachdb/storage peeps or @nvanbenschoten have opinions on Also, we need to coordinate this with @gh-casper's work on tenant streaming. I believe the tenant streaming will also strip out the
Might make sense with a couple of commits: one to plumb through
Only for non-empty headers though, right? So in the common case this won't add on anything at all? |
dd1843b to
f941a9e
Compare
|
@erikgrinaker Responding to a few points you made:
Done.
Done.
This prints out: Did I miss something in how I defined the proto message? Does it have something to do with this ? |
|
Ah, I was thinking more about this bit here, which just omits the entire header if it doesn't contain anything: So that seems fine.
I think you're missing the encoding of the embedded messages. Both Let's have a look at the encoding here. The MVCCValueHeader{
LocalTimestamp: hlc.ClockTimestamp{WallTime: 9},
}For reference, the Protobuf schemas are: message MVCCValueHeader {
util.hlc.Timestamp local_timestamp = 1;
ClientMeta client_meta = 2;
}
message ClientMeta {
int64 import_job_id = 1;
}
message Timestamp {
int64 wall_time = 1;
int32 logical = 2;
bool synthetic = 3;
}The encoded value before we added the
Ok, so far so good. We see that it omitted fields 2 and 3 of Now let's add an empty Ok, so the first 4 bytes are the same as before. The 2 added bytes are:
Ok, so that makes sense -- it's saying that there's an embedded But in any case, this isn't terrible, because it's still omitting the actual message contents. So even if we're paying a fixed cost of 2 bytes for the message, at least we can keep adding on fields to FWIW, we can also see this with the |
|
wow, thanks for really digging into this! I learned a bunch from this write up. My takeaways are:
|
f941a9e to
38ee242
Compare
38ee242 to
a391869
Compare
msbutler
left a comment
There was a problem hiding this comment.
I think I addressed everything your latest round of feedback except for reporting benchmark results. Should I report those in the commit message or in the docstring? As expected, the latency percent increase for Encode/DecodeMVCC value varies depending on the amount of data already in the MVCC value. When I benchmarked EncodedMVCCValue for example, a header with a JobId was 22% slower than an empty header with a short encoded MVCCValue.value, and 4% slower with a long encoded MVCCValue.value.
a391869 to
15147b2
Compare
When making changes like these, it's usually nice to explain the cost in the commit message. What the reader/reviewer really wants to know is how expensive this is going to be. How much additional data are we storing? Are imports now 1% slower? 10% slower? Not really slower at all? Numbers like these help paint a picture of the cost. 22% slower encoding might matter, but how much does the encoding cost make up of the per-key processing cost? The absolute value per encoding might be helpful to understand the which order of magnitude we're talking (i.e. nanoseconds vs milliseconds), although I'm guessing we're talking O(100 ns) per key here. |
15147b2 to
ac85d19
Compare
|
Given yesterday's meeting, here's the updated design:
|
cd3754f to
dbb0003
Compare
|
Reopened and rebased. Let's see what CI says. |
7434fef to
f7e396a
Compare
|
@erikgrinaker You approved an old version of this, but you may want to take another look. |
|
NB: I am going to add this new option to the MVCCPut, et al functions in the PR that adds a new delete range predicate. But I'm happy to make one big PR if that is easier for y'all. |
eh, I say merge this one as-is now and add follow-ups to follow-ups. I easier to review that way and we’ll start getting some nightly runs. |
msbutler
left a comment
There was a problem hiding this comment.
Looks great! could you run this through the mixed version roachtest before merging?
| "jobID": {ImportEpoch: 3}, | ||
| "local walltime": {LocalTimestamp: hlc.ClockTimestamp{WallTime: 1643550788737652545}}, | ||
| "local walltime+logical": {LocalTimestamp: hlc.ClockTimestamp{WallTime: 1643550788737652545, Logical: 4096}}, | ||
| "omit in rangefeeds": {OmitInRangefeeds: true}, |
There was a problem hiding this comment.
could you add the bench results to the commit msg?
f7e396a to
b8ae5f0
Compare
msbutler
left a comment
There was a problem hiding this comment.
LGTM! I can't approve because of github shenanigans.
Sorry, still making my way through the queue, will get to this in the morning. |
b8ae5f0 to
1ec8a18
Compare
erikgrinaker
left a comment
There was a problem hiding this comment.
LGTM once the minor comments below are addressed. I've only superficially reviewed this, particularly the DR parts.
Reviewed 4 of 19 files at r6, 7 of 8 files at r7, 2 of 11 files at r8, 15 of 15 files at r17, 10 of 10 files at r18, 3 of 3 files at r19, 7 of 7 files at r20, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @msbutler)
pkg/ccl/backupccl/backup_processor.go line 485 at r20 (raw file):
for _, span := range spans { for len(span.span.Key) != 0 { includeMVCCValueHeader := clusterSettings.Version.IsActive(ctx, clusterversion.V24_1)
Is it ok to flip this in the middle of a backup? Can we get into weird states where part of the backup has value headers but the other part doesn't? Can we race with an import starting to write import epochs, but the backup processor not seeing them for part of the exports?
Actually, why do we even need to version gate this? A 23.2 node will ignore the field, but also won't ever export any non-empty value headers. A 24.1 node may have non-empty value headers after the 24.1 gate has switched and an import has started writing them, but at that point all binaries are prepared to handle them anyway. So I think we can just set this unconditionally, and avoid any races altogether?
pkg/ccl/backupccl/restore_data_processor.go line 591 at r19 (raw file):
// InitChecksum don't copy/reallocate the slice they // were given. if err := batcher.AddMVCCKey(ctx, key, valueScratch); err != nil {
It's not obvious at all here that the checksum rewriting above modifies valueScratch, it's worth spelling this out.
pkg/kv/kvserver/kvserverbase/bulk_adder.go line 77 at r18 (raw file):
// Adder's SSTBatcher will write the import epoch to each versioned value's // metadata. ImportEpoch uint32
Is it worth calling out here that this must only be used with 24.1 clusters, since it's the caller's responsibility to enforce the version gate?
pkg/storage/mvcc_value.go line 144 at r17 (raw file):
// Consider a fast path, where only the roachpb.Value gets exported. // Currently, this only occurs if the value was not imported. if mvccValue.ImportEpoch == 0 {
This seems a bit brittle, in that we could accidentally omit fields that are added later. Should we invert the fast path such that we first strip fields (i.e. the LocalTimestamp) and then compare the stripped value header with MVCCValueHeader{} to omit the encode?
| func EncodeMVCCValueForExport(mvccValue MVCCValue) ([]byte, error) { | ||
| // Consider a fast path, where only the roachpb.Value gets exported. | ||
| // Currently, this only occurs if the value was not imported. | ||
| if mvccValue.ImportEpoch == 0 { |
There was a problem hiding this comment.
It is weird to me that we get the whole header or none of it just depending on this one field?
If we're going to condition on this field, it sorta feels like this should be the only field you get, or, more generally, you should only get fields that are in the condition?
There was a problem hiding this comment.
Thanks for pointing this out. I think when this was written ImportEpoch and Timestamp did represent the entire value header. It looks like EncodeMVCCValue already has a special case for the empty case. So I think we can just have this function zero out fields not appropriate for export and then pass it directly to EncodeMVCCValue.
Edit: Although, perhaps for now the blast radius is smaller if we really do just give you back something with the ImportEpoch so we don't find out what happens if we start importing keys with OmitInRangefeeds set.
There was a problem hiding this comment.
We now only export fields in the condition. I've left the condition in place because the benchmark shows it really does help in the most common case.
pkg/sql/importer/import_processor.go
Outdated
| true /* isPKAdder */) | ||
|
|
||
| var bulkAdderImportEpoch uint32 | ||
| if flowCtx.Cfg.Settings.Version.IsActive(ctx, clusterversion.V24_1) && len(spec.Tables) == 1 { |
There was a problem hiding this comment.
The len()==1 makes me nervous.
What about in the loop body:
if bulkAdderImportEpoch == 0 { bulkAdderImportEpoch = v.Desc.ImportEpoch } else if bulkAdderImportEpoch != v.Desc.InportEpoc { return errors.Assertion....
There was a problem hiding this comment.
I like it. Done.
I also moved the version check. Now, we only bump the import epoch on 24.1 and then can here just rely on the default value being zero.
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @michae2, and @msbutler)
pkg/ccl/backupccl/backup_processor.go line 485 at r20 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Is it ok to flip this in the middle of a backup? Can we get into weird states where part of the backup has value headers but the other part doesn't? Can we race with an import starting to write import epochs, but the backup processor not seeing them for part of the exports?
Actually, why do we even need to version gate this? A 23.2 node will ignore the field, but also won't ever export any non-empty value headers. A 24.1 node may have non-empty value headers after the 24.1 gate has switched and an import has started writing them, but at that point all binaries are prepared to handle them anyway. So I think we can just set this unconditionally, and avoid any races altogether?
I think we do need the version check here but you are right that we don't want it changing through the life of the backport. I'll add some state to our spec.
We need the version because if our version is 23.2 we expect to be able to restore the artifact on 23.2 nodes, but 23.2 nodes won't be able to read our SSTs.
stevendanna
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @erikgrinaker, @michae2, and @msbutler)
pkg/ccl/backupccl/restore_data_processor.go line 591 at r19 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
It's not obvious at all here that the checksum rewriting above modifies
valueScratch, it's worth spelling this out.
Added a comment.
pkg/storage/mvcc_value.go line 144 at r17 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This seems a bit brittle, in that we could accidentally omit fields that are added later. Should we invert the fast path such that we first strip fields (i.e. the
LocalTimestamp) and then compare the stripped value header withMVCCValueHeader{}to omit the encode?
I keep going back and forth on this. I've changed the code so that you only get this single field exported. But I do see the potential value in exporting everything.
erikgrinaker
left a comment
There was a problem hiding this comment.
Reviewed 8 of 9 files at r21, 19 of 19 files at r23, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, @miraradeva, and @msbutler)
pkg/ccl/backupccl/backup_processor.go line 485 at r20 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I think we do need the version check here but you are right that we don't want it changing through the life of the backport. I'll add some state to our spec.
We need the version because if our version is 23.2 we expect to be able to restore the artifact on 23.2 nodes, but 23.2 nodes won't be able to read our SSTs.
Right, makes sense.
pkg/storage/mvcc_value.go line 144 at r17 (raw file):
Previously, stevendanna (Steven Danna) wrote…
I keep going back and forth on this. I've changed the code so that you only get this single field exported. But I do see the potential value in exporting everything.
Ok. What about omit_in_rangefeeds, should those not be preserved across a backup/restore? I haven't been involved with that work, so I don't know what semantics we want.
cc @miraradeva
David and I discussed this is chat and for now I think we'll exclude it. We can perhaps revisit in a follow-up later this week. |
|
bors r=dt,erikgrinaker |
|
bors cancel |
|
Canceled. |
This patch adds an ImportEpoch field to an MVCCValue's MVCCValueHeader, which allows KV clients (namely the sst_batcher in an IMPORT INTO) to write the importing table's ImportEpoch to the metadata of each ingesting MVCCValue. Unlike the MVCCValueHeader.LocalTimestamp field, the ImportEpoch field should be exported to other clusters (e.g. via ExportRequests from BACKUP/RESTORE and streaming). Consequently, this PR relaxes the invariant that the MVCCValueHeader field must be stripped in an ExportRequest and must be empty in an AddSSTable Request. Now, ExportRequest emits MVCCValueHeaders with ImportEpoch set if it was set in the original value and AddSSTable only requires the LocalTimestamp to be empty. Epic: none Release note: none Co-authored-by: Steven Danna <danna@cockroachlabs.com>
This patch makes IMPORT INTO on a non-empty table write the table's
ImportEpoch to each ingested MVCC Value, via the SSTBatcher. In a
future PR, the ImportEpoch will be used rollback an IMPORT INTO in
some cases.
This additional information will allow IMPORTing tables to be backed
up and restored.
As part of this change we now also assume we might see an MVCCValue
during restore.
* Version Gating
Previously, callers could (and did) assume that the values present in
the SSTs returned by export request could be interpreted directly as
roachpb.Value objects using code like:
roachpb.Value{RawBytes: valBytes}
For MVCCValueHeaders to be exported by ExportRequest all callers need
to be updated:
1. ExportRequest on system.descriptors in sql/catalog/lease
2. ExportRequest on system.descriptors in ccl/changefeedccl/schemafeed
3. ExportRequest used by `FINGERPRINT`
4. ExportRequest used by old binaries in a mixed-version cluster.
(1) and (2) will be easy to update and likely don't matter in practice
moment as those tables do not include values with exportable value
headers at the moment.
(3) will be easy to update, but we still need an option to exclude
value headers (a) until value headers are included in rangefeeds
and (b) so long as we want to compare fingerprints with 23.2 versions.
(4) is impossible to update so if we want BACKUP/RESTORE to round-trip
in mixed version cluster we must version gate including them in
backups until the cluster is on a single version.
To account for this we only increment ImportEpoch during IMPORTs that
start on 24.1 or greater and we only request MVCCValueHeaders on
BACKUPs that start on 24.1 or greater.
The first condition is important to ensure that we can later detect a
table that can be fully rolled back using the new rollback method.
Note that this also marks a hard backward incompatibility for backup
artifacts. Backups for 24.1 cannot be restored on 23.2 or older. This
was already the case by policy. 23.2 backups should still work fine on
24.1 since all roachpb.Value's should properly decode as MVCCValue's.
Informs cockroachdb#76722
Release note: None
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
|
bors r=dt,erikgrinaker |
|
Build succeeded: |
storage: add ImportEpoch field to MVCCValueHeader
This patch adds the ImportEpoch field to an MVCCValue's MVCCValueHeader,
which allows kv clients (namely the sst_batcher in an IMPORT INTO) to write
the importing table's ImportEpoch to the metadata of each ingesting MVCCValue.
Unlike the MVCCValueHeader.LocalTimestamp field, the ImportEpoch field should
be exported to other clusters (e.g. via ExportRequests from BACKUP/RESTORE and
streaming). Consequently, this PR relaxes the invariant that the
MVCCValueHeader field must be stripped in an Export Request and must be empty
in an AddSSTable Request. Now, Export Request only strips the
MVCCValueHeader.LocalTimestamp field and AddSSTable will only require the
LocalTimestamp to be empty.
Release note: none
bulk/kv write the table's ImportEpoch to each MVCCValue during IMPORT
This patch makes IMPORT INTO on a non-empty table write the table's ImportEpoch
to each ingested MVCC Value, via the SSTBatcher. In a future PR, the
ImportEpoch will be used to track and rollback an IMPORT INTO. This additional
information will allow IMPORTing tables to be backed up and restored, as
described in this RFC.
Informs #76722
Release note: None