Skip to content

backupccl: write a slimmer manifest without alloc heavy repeated fields #93272

@adityamaru

Description

@adityamaru

On backups with many incremental layers, we see significant memory spikes when unmarshalling backup manifests. This is not a new issue but it is exacerbated when we have long incremental chains on clusters with several ranges and/or descriptor changes.

profile

While a large part of the manifest handling logic is memory monitored, we only have an approximate accounting for the unmarshalling of the manifest. This is because prior to unmarshalling we do not know how many repeated fields there might be in the manifest and so cannot accurately allocate space for all the slices in the manifest protobuf.

In a future release, we will be moving all repeated fields in the backup manifest to SST files that can then be iterated over when they need to be read. This will slim down the manifest significantly and reduce the amount of memory used by the code resolving the manifests of a backup chain. While this will be available in a future release, this issue tracks the work with an eye for backporting to existing releases of Cockroach to further reduce the chance of OOMs.

Proposal:

  1. Older releases of Cockroach will write a new NEW_BACKUP_MANIFEST side-by-side with the existing BACKUP_MANIFEST with the Files (and portentially Descs) repeated field of the manifest nil'ed out. This repeated field will instead be written as a files.sst at the time of the backup. We continue writing BACKUP_MANIFEST to maintain mixed-version compatibility

  2. All read paths on older releases of Cockroach will first check for the existence of a NEW_BACKUP_MANIFEST. If found, we will unmarshal the slimmed-down manifest, and read the Files from the files.sst. This unmarshalling is expected to be far less allocy because of the nil'ed out repeated fields.

  3. If NEW_BACKUP_MANIFEST does not exist, we will default to our current behaviour of reading BACKUP_MANIFEST.

Note, this work will be superseded by work to switch all repeated fields in the manifest to SSTs, but in the meantime this should allow customers to run long(er) incremental chains without OOMing.

Jira issue: CRDB-22254

Epic: CRDB-19061

gz#15223

Metadata

Metadata

Assignees

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions