-
Notifications
You must be signed in to change notification settings - Fork 4.1k
backupccl: offline table data in revision history backups can leak into restored cluster #88042
Description
Before 22.2, backups were assumed to exclude data from offline, importing, tables; however, backups with revision history will contain offline table data because getRelevantDescChanges will include offline table descriptors contained in the target database(s). Note that it makes sense to include the offline table in the backup to ensure the user can conduct a RESTORE AOST=beforeImportStartTime, which would restore the importing table to it's pre-import state. However, the inclusion of offline table data in the backup can also lead to corrupt data on restore.
Consider the following sequences:
Sequence 1: with a rolled back import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: rollback import foo via non-mvcc clear range
t3: incremental backup foo with revision history
- fails to reintroduce foo
t4: restore foo to latest time
- b/c of the non-mvcc clear range, the incremental backup is completely naive to the rollback, thus, the importing data will get restored.
Sequence 2: with a completed import
t0: begin IMPORT on foo
t1: backup foo with revision history - captures foo's pre-import state and some importing data
t2: complete IMPORT on foo
t3: incremental backup foo with revision history
- fails to reintroduce foo. if the IMPORT used non-mvcc AddSSTable, then this incremental backup could have missed
spans from the completed import, leading to data loss.
t4: restore foo to latest time
- foo could get restored with some of the imported data
Important note: in either scenario, if another incremental backup completed between t0 and t2, the backup/restore would work just fine. I.e. if an incremental backup observed the table offline at the start and end of its interval, there's no bug.
This bug closely relates to #87305 which is also apparent in 22.2, except this one also manifests on earlier releases and only for backups with revision history. Further, this bug is actually worse than #87305, because here, the incremental backup at t3 does not reintroduce foo's spans, rendering the backup unrestorable, and currently, with undetectable data corruption.
Here's the root cause:
- foo only appears in the
revsinput togetReintroducedSpans, not intablesbecausetablesis created fromprevBackup.Descriptorsbut as described above,foois excluded from this field. This matters because when we look through revs to add totablesToReincludeandreintroducedTables, we only add arevif it was already in theofflineInLastBackupmap which is constructed with thetablesvariable.
The implications:
- This backup chain cannot get restored to a valid state because we did not reintroduce foo.
- The next full backup that runs will be fine.
- This bug should not affect a backup chain that has been taken fully on 22.2, where we backup all offline spans, but could affect a chain with backups taken on earlier versions.
Jira issue: CRDB-19657