-
Notifications
You must be signed in to change notification settings - Fork 4.1k
backupccl: write a slimmer manifest without alloc heavy repeated fields #93272
Description
On backups with many incremental layers, we see significant memory spikes when unmarshalling backup manifests. This is not a new issue but it is exacerbated when we have long incremental chains on clusters with several ranges and/or descriptor changes.
While a large part of the manifest handling logic is memory monitored, we only have an approximate accounting for the unmarshalling of the manifest. This is because prior to unmarshalling we do not know how many repeated fields there might be in the manifest and so cannot accurately allocate space for all the slices in the manifest protobuf.
In a future release, we will be moving all repeated fields in the backup manifest to SST files that can then be iterated over when they need to be read. This will slim down the manifest significantly and reduce the amount of memory used by the code resolving the manifests of a backup chain. While this will be available in a future release, this issue tracks the work with an eye for backporting to existing releases of Cockroach to further reduce the chance of OOMs.
Proposal:
-
Older releases of Cockroach will write a new
NEW_BACKUP_MANIFESTside-by-side with the existingBACKUP_MANIFESTwith theFiles(and portentiallyDescs) repeated field of the manifest nil'ed out. This repeated field will instead be written as afiles.sstat the time of the backup. We continue writingBACKUP_MANIFESTto maintain mixed-version compatibility -
All read paths on older releases of Cockroach will first check for the existence of a
NEW_BACKUP_MANIFEST. If found, we will unmarshal the slimmed-down manifest, and read theFilesfrom the files.sst. This unmarshalling is expected to be far less allocy because of the nil'ed out repeated fields. -
If
NEW_BACKUP_MANIFESTdoes not exist, we will default to our current behaviour of readingBACKUP_MANIFEST.
Note, this work will be superseded by work to switch all repeated fields in the manifest to SSTs, but in the meantime this should allow customers to run long(er) incremental chains without OOMing.
Jira issue: CRDB-22254
Epic: CRDB-19061
gz#15223
