Skip to content

backupccl: add tenant backup nemesis tests#92327

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
stevendanna:ssd/tenant-backup-nemesis
Jan 3, 2023
Merged

backupccl: add tenant backup nemesis tests#92327
craig[bot] merged 2 commits intocockroachdb:masterfrom
stevendanna:ssd/tenant-backup-nemesis

Conversation

@stevendanna
Copy link
Copy Markdown
Collaborator

@stevendanna stevendanna commented Nov 22, 2022

This adds a bit of randomized testing to stress incremental tenant backups.

Tenant backups were previously untenable because of non-MVCC operations. Now that we support MVCC-compatible bulk operations, they should in principal be possible.

Epic: CRDB-20264

Release note: None

@stevendanna stevendanna requested review from a team as code owners November 22, 2022 12:46
@stevendanna stevendanna requested review from herkolategan, rhu713 and smg260 and removed request for a team November 22, 2022 12:46
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@stevendanna stevendanna requested review from dt and msbutler and removed request for herkolategan, rhu713 and smg260 November 23, 2022 14:32
Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
@stevendanna stevendanna force-pushed the ssd/tenant-backup-nemesis branch from 2ade0ee to d08ac44 Compare November 30, 2022 15:04
@stevendanna stevendanna force-pushed the ssd/tenant-backup-nemesis branch from d08ac44 to 7b4795e Compare December 21, 2022 12:38
@stevendanna
Copy link
Copy Markdown
Collaborator Author

@msbutler Friendly ping here. I think this should be RFAL now that we've fixed some of the underlying issues that were preventing it from working before.

@stevendanna stevendanna force-pushed the ssd/tenant-backup-nemesis branch from 7b4795e to 1f70a1d Compare December 21, 2022 14:35
Copy link
Copy Markdown
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'm having a hard time understanding/reviewing the new nemesis infra. I'll give it another pass once you add some comments, or we can setup a time to chat if that's better.

Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
})

require.NoError(t, g.Wait())
restoreQuery := fmt.Sprintf("RESTORE TENANT 10 FROM LATEST IN '%s' AS OF SYSTEM TIME %s WITH tenant = '11'", backupLoc, aost)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like we only run validation on the latest aost set in the above goroutine. What if we randomly chose an aost so that we restore at some random point in the chain?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this as a follow up. To do this, we need to track which tables are valid to test at the point of the restore.

Perhaps we could get around that by just fingerprinting the whole tenant.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We could use Aditya's fingerprint job here and remove the TablesToCheck stuff.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason table-by-table is nice for now is that the table names are directly related to the nemesis that runs which gives you a foothold into what failed.

Copy link
Copy Markdown
Collaborator

@msbutler msbutler Dec 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good point. This also makes me wonder: should we print or log the seed returned by the NewPseudoRand() call to make it easier to repro a future test failure?

validateTableAcrossTenantRestore(t, tenant10SQLDB, tenant11SQLDB, fmt.Sprintf(`bank."%s"`, name), aost)
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these helper functions are quite nice, I wonder if it's worth putting them in the backup utils_test.go package. Happy to leave this as a todo.

Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated
Comment thread pkg/ccl/backupccl/tenant_backup_nemesis_test.go Outdated

nemeses: []nemesis{
{name: "CANCELED IMPORT INTO",
impl: func(ctx context.Context, n *randomBackupNemesis, db *gosql.DB) error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the number of incremental backups the import job spanned had some affect on the corruption cases we saw. I wonder if we could randomly vary the speed of the import job, or any of the jobs below.

Copy link
Copy Markdown
Collaborator Author

@stevendanna stevendanna Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it would be a fun thing to add. The other thing I was thinking of adding to the "listener" stuff was something where the operations could let you know when an "interesting" point in the operations was occurring and then allow you to block the operation, take a backup, and continue.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either would be great. whatever is easier to implement!

Copy link
Copy Markdown
Collaborator Author

@stevendanna stevendanna Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this would require a good deal more infrastructure such as that which is already causing confusion. So I'll leave this to whoever is motivated to touch this test next.

@stevendanna stevendanna force-pushed the ssd/tenant-backup-nemesis branch 2 times, most recently from 6b681ef to f0a00d9 Compare December 22, 2022 18:20
@stevendanna
Copy link
Copy Markdown
Collaborator Author

@msbutler I left a few new comments. All of the "oneTimeListener" stuff is there because I was playing around with whether forcing the nemesis operations to be interleaved with the incrementals was important. In the end I decided to just bump the incremental count to 30 and let randomness take its course, but since it is just a few lines of code I've left it to keep playing with as we add more operations.

@stevendanna stevendanna force-pushed the ssd/tenant-backup-nemesis branch from f0a00d9 to 75d348f Compare December 22, 2022 22:01
@stevendanna
Copy link
Copy Markdown
Collaborator Author

Okie Dokie @msbutler I went ahead and removed the RequireStart stuff and confirmed that this still catches the import bug with range tombstones turned off.

Copy link
Copy Markdown
Collaborator

@msbutler msbutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for simplifying the test a bit!

@msbutler
Copy link
Copy Markdown
Collaborator

And good call to revert the extra infra in a separate commit -- it will make it easy to come back to.

This adds a bit of randomized testing to stress incremental tenant
backups.

Tenant backups were previously untenable because of non-MVCC
operations. Now that we support MVCC-compatible bulk operations, they
should in principal be possible.

Release note: None
@stevendanna stevendanna force-pushed the ssd/tenant-backup-nemesis branch from 75d348f to 4cc6007 Compare January 3, 2023 10:00
@stevendanna
Copy link
Copy Markdown
Collaborator Author

bors r=msbutler

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 3, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jan 3, 2023

Build succeeded:

@craig craig bot merged commit be9e8fb into cockroachdb:master Jan 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants