backupccl: elide data from offline tables during RESTORE#85920
backupccl: elide data from offline tables during RESTORE#85920craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
840a15a to
d70ccad
Compare
|
RFAL! all test failures relate to first commit which relates to #85979, and not this PR. |
d70ccad to
6ff17f2
Compare
58c1887 to
e4fa154
Compare
|
|
||
| // Elide Importing Keys | ||
| if importTime := desc.GetInProgressImportStartTime(); wallTime > 0 && importTime > 0 && wallTime >= importTime { | ||
| return nil, false, ErrImportingKeyError |
There was a problem hiding this comment.
do you need a non-nil err? isn't false enough?
There was a problem hiding this comment.
In the public RewriteKey() function, when rewriteTableKey returns !ok and a nil err, this codepath below occurs, which seems unnecessary when eliding importing keys, but perhaps whenever !ok returns, this codepath should get skipped?
cockroach/pkg/ccl/backupccl/key_rewriter.go
Lines 239 to 259 in e4fa154
e4fa154 to
9bd87ae
Compare
|
RFAL. I've limited the scope of this PR to deal with handling offline tables in restore. Currently on the backup side, all tables that come back online will get fully re-backed up. In a future PR, I will skip this re-backup step on tables that only see mvcc ops during the inc backup interval. |
9bd87ae to
81232a8
Compare
81232a8 to
329783d
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Flushing out one round of comments, I need to look more carefully at the tests but can we maybe write a unit test in key_rewriter_test.go that isolates the new KR logic? The restore/import tests are super valuable but a unit test will give us more focussed control on testing the timestamp comparisons.
pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go
Outdated
Show resolved
Hide resolved
pkg/ccl/streamingccl/streamingest/stream_ingestion_processor_test.go
Outdated
Show resolved
Hide resolved
42bcd88 to
5ccd368
Compare
msbutler
left a comment
There was a problem hiding this comment.
thanks for first pass! I think i got all your comments and added a keyrewriter test as well.
pkg/ccl/streamingccl/streamingest/stream_ingestion_processor.go
Outdated
Show resolved
Hide resolved
e4f4c9b to
2bd82a3
Compare
adityamaru
left a comment
There was a problem hiding this comment.
Nice test coverage, and thanks for the comments!
|
|
||
| func (sip *streamIngestionProcessor) rekey(key roachpb.Key) ([]byte, error) { | ||
| rekey, ok, err := sip.rekeyer.RewriteKey(key) | ||
| rekey, ok, err := sip.rekeyer.RewriteKey(key, 0 /*wallTime*/) |
There was a problem hiding this comment.
If we are not going the route of passing in storage.MVCCKey is there any reason to pass a 0 wallTime here? Can we always pass the keys wallTime in non-test code and get rid of the special meaning of the zero timestamp?
You had mentioned performance being an issue but I think we had settled on:
SIP is not production ready and so when we make it performant and if we see this show up in benchmarks and profile we will special case it or duplicate the method for better readability.
There was a problem hiding this comment.
Currently, passing a 0 walltime signals to RewriteKey to never filter on walltime. A SIP may encounter a descriptor with an in progress import, and I don't want the SIP do key filtering. Passing a 0 walltime signals this. We could easily rename the var walltimeFilter to make this slightly more readable.
Alternatively, the caller could always pass a non-zero walltime and also pass a filterbyWalltime bool to makeKeyRewriter that binds to the KeyRewriter struct which signals whether walltimes should be used for filtering in RewriteKey.
Again, I prefer what I've initially drafted, as it requires managing fewer internal vars and has no perf impact on SIP. It's up to the caller to know whether they want to filter on import start time via this low level function thats optimized for performance.
There was a problem hiding this comment.
I don't want the SIP do key filtering
Have we thought this through though, as to why we don't want this behaviour to be uniform with the SIP? I am not very familiar with the implications of this so @stevendanna can probably help make a decision here.
It's up to the caller to know whether they want to filter on import start time
I agree with this sentiment but as you know I don't think this code is a reflection of that. The filtering should not be in this shared rewrite method if its not to be used by anyone but restore but given our offline discussions, I think a happy compromise would be to unconditionally pass in the ts of the key that we are passing in if the semantics are well understood for the SIP.
Renaming the variable to something other than timestamp makes it even more likely that we pass in a timestamp that is not the keys timestamp, which we should never iiuc?
There was a problem hiding this comment.
Anyways if we think we need to think about SIP a little more outside ther scope of this patch, I think things are good as the way they are. I just want to make sure we think about it at some point.
| @@ -0,0 +1,469 @@ | |||
| # This test ensures that database and cluster backups properly | |||
There was a problem hiding this comment.
meta note: we're starting 6 servers in this test which could make it quite a slow test, I don't know if theres anything we can do about it given the lack of a "wipe everything" tool for cluster restore but we should run this under stress a few times.
| d foofoo table 3 incremental | ||
|
|
||
|
|
||
| # A restore on mixed version cluster should always fail on a backup that contains import times |
There was a problem hiding this comment.
The backup to database_upgrade won't contain descriptors with import time set right even though its post finalization, because the import was run in a pre-finalized state?
msbutler
left a comment
There was a problem hiding this comment.
you're welcome for all the new test coverage :D
| exec-sql | ||
| CREATE DATABASE b; | ||
| USE b; | ||
| CREATE TABLE baz (i INT PRIMARY KEY, s STRING); |
There was a problem hiding this comment.
Done. Let me know if this change is what you had in mind!
| ---- | ||
|
|
||
|
|
||
| # Ensure Database, and cluster full backups capture restoring rows. |
There was a problem hiding this comment.
Done. Added to bottom of test to make all the test results easier to reason about.
| # ImportStartTime is not bound to the table's descriptor, therefore during | ||
| # RESTORE AOST in-progress IMPORT, these tables should get thrown out. | ||
|
|
||
| # TODO (msbutler): Currently, we still incrementally back up data from these |
There was a problem hiding this comment.
I removed this comment. this is more relevant to the the upcoming reintroduced spans pr.
|
@adityamaru RFAL! |
13871cc to
49a6a70
Compare
adityamaru
left a comment
There was a problem hiding this comment.
LGTM with nits. Thanks for working with me through this, I think the test coverage is extensive so we should get this in and let it bake. Can you ping SQL Schema once this diff is borsing so that they are well informed when they're fixing their bug?
| job paused at pausepoint | ||
|
|
||
|
|
||
| # ensure user cannot override offline schema created by table restore into data database d |
There was a problem hiding this comment.
is this really checking this? isn't this collision because of the successful cluster restore you did at L139 after you started server 3? In fact a paused table restore should imo never interfere with the creation of a schema in the target database since we've not published the descriptors in the paused restore yet.
There was a problem hiding this comment.
Ah good catch-- though I'm not sure d in the backed up cluster from L139 actually has the schema. Anyway, I added the following to clarify things, and this test still holds.
USE d;
DROP SCHEMA IF EXISTS me CASCADE;
Note that during the beginning of restore execution, the schema is published to an offline state. I'm not sure how to implement the UX you want. How is the user supposed to override an already published schema? Seems like an issue for another day.
A previous PR (cockroachdb#83915) enabled BACKUP to target offline descriptors (including IMPORTing and RESTOREing tables), but didn't fully handle their rollback on RESTORE. This PR ensures RESTORE properly handles these offline tables: - An offline table created by RESTORE or IMPORT PGDUMP is fully discarded. The table does not exist in the restoring cluster. - An offline table created by an IMPORT INTO has all importing data elided in the restore processor and is restored online to its pre import state. If the RESTORE occurs AOST after the table returns online, the table is RESTOREd to it's post offline state. Currently, all tables that return online are fully re-backed up. For many situations above, this is unecessary. Future work should ensure that these tables are not re-backed up when we know all operations within the incremental backup interval on the table were MVCC. This patch only applies to BACKUP and RESTORE flavors that implicitly target an offline importing/restoring table. A future PR should allow a user to explicitly target an existing offline table undergoing an IMPORT. Release justification: fixes bug where offline tables were not properly handled in restore.
bf43d03 to
0dbf2ea
Compare
|
TFTRS!! bors r=adityamaru |
|
Build succeeded: |
A previous PR (#83915) enabled BACKUP to target offline descriptors
(including IMPORTing and RESTOREing tables), but didn't fully handle their
rollback on RESTORE. This PR ensures RESTORE properly handles these offline
tables:
table does not exist in the restoring cluster.
the restore processor and is restored online to its pre import state.
If the RESTORE occurs AOST after the table returns online, the table is
RESTOREd to its post offline state.
Currently, all tables that return online are fully re-backed up. For many
situations above, this is unecessary. Future work should ensure that these
tables are not re-backed up when we know all operations within the
incremental backup interval on the table were MVCC.
This patch only applies to BACKUP and RESTORE flavors that implicitly target an
offline importing/restoring table. A future PR should allow a user to
explicitly target an existing offline table undergoing an IMPORT.
Fixes #86054
Release justification: fixes bug where offline tables were not properly handled
in restore.