Skip to content

backupccl: offline table data in revision history backups can leak into restored cluster #88042

@msbutler

Description

@msbutler

Before 22.2, backups were assumed to exclude data from offline, importing, tables; however, backups with revision history will contain offline table data because getRelevantDescChanges will include offline table descriptors contained in the target database(s). Note that it makes sense to include the offline table in the backup to ensure the user can conduct a RESTORE AOST=beforeImportStartTime, which would restore the importing table to it's pre-import state. However, the inclusion of offline table data in the backup can also lead to corrupt data on restore.

Consider the following sequences:

Sequence 1: with a rolled back import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: rollback import foo via non-mvcc clear range
t3: incremental backup foo with revision history

  • fails to reintroduce foo

t4: restore foo to latest time

  • b/c of the non-mvcc clear range, the incremental backup is completely naive to the rollback, thus, the importing data will get restored.

Sequence 2: with a completed import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: complete IMPORT on foo
t3: incremental backup foo with revision history

  • fails to reintroduce foo. if the IMPORT used non-mvcc AddSSTable, then this incremental backup could have missed
    spans from the completed import, leading to data loss.

t4: restore foo to latest time

  • foo could get restored with some of the imported data

Important note: in either scenario, if another incremental backup completed between t0 and t2, the backup/restore would work just fine. I.e. if an incremental backup observed the table offline at the start and end of its interval, there's no bug.

This bug closely relates to #87305 which is also apparent in 22.2, except this one also manifests on earlier releases and only for backups with revision history. Further, this bug is actually worse than #87305, because here, the incremental backup at t3 does not reintroduce foo's spans, rendering the backup unrestorable, and currently, with undetectable data corruption.

Here's the root cause:

  • foo only appears in the revs input to getReintroducedSpans, not in tables because tables is created from prevBackup.Descriptors but as described above, foo is excluded from this field. This matters because when we look through revs to add to tablesToReinclude and reintroducedTables, we only add a rev if it was already in the offlineInLastBackup map which is constructed with the tables variable.

The implications:

  • This backup chain cannot get restored to a valid state because we did not reintroduce foo.
  • The next full backup that runs will be fine.
  • This bug should not affect a backup chain that has been taken fully on 22.2, where we backup all offline spans, but could affect a chain with backups taken on earlier versions.

Jira issue: CRDB-19657

Metadata

Metadata

Assignees

Labels

A-disaster-recoveryC-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.T-disaster-recovery

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions