Skip to content

backupccl: scheduled backups should stop adding to an incremental chain after a certain number of running jobs #96110

@adityamaru

Description

@adityamaru

In https://github.com/cockroachlabs/support/issues/2030 we saw an instance of a schedule running incremental backups every hour with on_previous_running set to start. For reasons the hourly incrementals were not completing which resulted in a buildup of 30+ running, incremental jobs. This resulted in nodes OOMing, and general cluster instability. As such, this is working as expected, but it is an easy footgun and one we should safeguard against. If a backup schedule observes > x incremental backup jobs running on its behalf we should do something. This could include skipping scheduling an incremental until the running jobs count falls below x with adequate logs/warnings.

Jira issue: CRDB-23951

Epic CRDB-21944

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions