backupccl: add tenant backup nemesis tests#92327
backupccl: add tenant backup nemesis tests#92327craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
2ade0ee to
d08ac44
Compare
d08ac44 to
7b4795e
Compare
|
@msbutler Friendly ping here. I think this should be RFAL now that we've fixed some of the underlying issues that were preventing it from working before. |
7b4795e to
1f70a1d
Compare
| }) | ||
|
|
||
| require.NoError(t, g.Wait()) | ||
| restoreQuery := fmt.Sprintf("RESTORE TENANT 10 FROM LATEST IN '%s' AS OF SYSTEM TIME %s WITH tenant = '11'", backupLoc, aost) |
There was a problem hiding this comment.
seems like we only run validation on the latest aost set in the above goroutine. What if we randomly chose an aost so that we restore at some random point in the chain?
There was a problem hiding this comment.
I'm going to leave this as a follow up. To do this, we need to track which tables are valid to test at the point of the restore.
Perhaps we could get around that by just fingerprinting the whole tenant.
There was a problem hiding this comment.
Sounds good. We could use Aditya's fingerprint job here and remove the TablesToCheck stuff.
There was a problem hiding this comment.
The reason table-by-table is nice for now is that the table names are directly related to the nemesis that runs which gives you a foothold into what failed.
There was a problem hiding this comment.
ah, good point. This also makes me wonder: should we print or log the seed returned by the NewPseudoRand() call to make it easier to repro a future test failure?
| validateTableAcrossTenantRestore(t, tenant10SQLDB, tenant11SQLDB, fmt.Sprintf(`bank."%s"`, name), aost) | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
these helper functions are quite nice, I wonder if it's worth putting them in the backup utils_test.go package. Happy to leave this as a todo.
|
|
||
| nemeses: []nemesis{ | ||
| {name: "CANCELED IMPORT INTO", | ||
| impl: func(ctx context.Context, n *randomBackupNemesis, db *gosql.DB) error { |
There was a problem hiding this comment.
the number of incremental backups the import job spanned had some affect on the corruption cases we saw. I wonder if we could randomly vary the speed of the import job, or any of the jobs below.
There was a problem hiding this comment.
This seems like it would be a fun thing to add. The other thing I was thinking of adding to the "listener" stuff was something where the operations could let you know when an "interesting" point in the operations was occurring and then allow you to block the operation, take a backup, and continue.
There was a problem hiding this comment.
either would be great. whatever is easier to implement!
There was a problem hiding this comment.
Well, this would require a good deal more infrastructure such as that which is already causing confusion. So I'll leave this to whoever is motivated to touch this test next.
6b681ef to
f0a00d9
Compare
|
@msbutler I left a few new comments. All of the "oneTimeListener" stuff is there because I was playing around with whether forcing the nemesis operations to be interleaved with the incrementals was important. In the end I decided to just bump the incremental count to 30 and let randomness take its course, but since it is just a few lines of code I've left it to keep playing with as we add more operations. |
f0a00d9 to
75d348f
Compare
|
Okie Dokie @msbutler I went ahead and removed the RequireStart stuff and confirmed that this still catches the import bug with range tombstones turned off. |
msbutler
left a comment
There was a problem hiding this comment.
LGTM! Thanks for simplifying the test a bit!
|
And good call to revert the extra infra in a separate commit -- it will make it easy to come back to. |
This adds a bit of randomized testing to stress incremental tenant backups. Tenant backups were previously untenable because of non-MVCC operations. Now that we support MVCC-compatible bulk operations, they should in principal be possible. Release note: None
Epic: none Release note: None
75d348f to
4cc6007
Compare
|
bors r=msbutler |
|
Build failed (retrying...): |
|
Build succeeded: |
This adds a bit of randomized testing to stress incremental tenant backups.
Tenant backups were previously untenable because of non-MVCC operations. Now that we support MVCC-compatible bulk operations, they should in principal be possible.
Epic: CRDB-20264
Release note: None