backupccl: incrementally backup in progress imports on existing tables and elide importing data in RESTORE

Currently, a table with an in-progress cannot get backed up, let-alone restored to its pre-import state.  Backing up an in-progress import also has the benefit of distributing the work of backing up the import over a series of incremental backups, as oppose to what currently occurs: the first incremental backup to begin after the import finishes has to back up everything.

To address this, main challenge involves rolling back imported data on the restored cluster. To understand why this is a challenge, consider how rollbacks occur today:

If an IMPORT writing data into an existing, non-empty cluster fails or is cancelled mid-IMPORT, to roll it back, any rows it had written are found and deleted by scanning the table for rows with a timestamp greater than the time at which the IMPORT started. This works since the table is offline to other writes while it is importing, but relies on the fact that the timestamps on rows do not change -- which may not be true if the table were backed up and then restored, after which all keys, both existing and imported, would have times based on when it was restored. 

The second paragraph in #76722 outlined one strategy which involved writing additional metadata to each imported key, and indeed several PRs began implementing this approach (#85338, #85692 #85138). However, we realized that binding the Import Start Time in the backed up table descriptor is sufficient. Specifically, when the `restore_data_processor` rewrites backed up keys to the restore cluster, it can use the ImportStartTime in the restored table descriptor to filter out keys in the backed up, in-progress import, _before_ AddSSTable rewrites the timestamps of all the keys.

Note: the more complicated approach outlined in #76722 would have been necessary if RESTOREs of _whole_ tenants implemented MVCC AddSSTable-- i.e. rewrote timestamps in RESTORE--  because during the restore, the host tenant cannot access tenant table descriptors and thus filter keys in the restore processor. And indeed, we thought it was necessary to make _whole_ tenant RESTOREs MVCC compatible. But now, we no longer think that whole tenant operations (like tenant streaming) need to be MVCC, since it's relatively easy to ensure that all downstream operations understand that whole tenant operations are non-MVCC. So given that whole tenant restores will continue to preserve timestamps from the backup, the restored tenant can rollback their import using the normal process described in the second paragraph.


Jira issue: CRDB-18546

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backupccl: incrementally backup in progress imports on existing tables and elide importing data in RESTORE #86054

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

backupccl: incrementally backup in progress imports on existing tables and elide importing data in RESTORE #86054

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions