-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Handle retention of failed and partial snapshots in SLM #46988
Description
Problem
Right now, SLM treats PARTIAL and FAILED snapshots the same, and both are kept around forever. This is unlikely to be the behavior users will expect from SLM, so SLM should handle retention of partial and failed snapshots as well.
Proposed solution
FAILED snapshots will be kept until the configured expire_after period has passed, if present, and then be deleted. If there is no configured expire_after in the retention policy, then they will be deleted if there is at least one more recent successful snapshot from this policy (as they may otherwise be useful for troubleshooting purposes). Failed snapshots are not counted towards either min_count or max_count. (This has been implemented in #47617)
PARTIAL snapshots are more likely to be useful, so need to be handled a bit differently. For this case, there are two potential routes: One that is simple, and one that attempts to be intuitive.
Simple
Partial snapshots are retained unless there is at least one more recent successful snapshot from the same policy, at which point they are deleted after the expire_after period has passed, if present. If expire_after is not present and there is a more recent successful snapshot, they are deleted in the next retention run. In this case, partial snapshots are not counted toward either min_count or max_count, which count successful snapshots only. (This has been implemented in #47833)
Complex
- If
min_countis the only condition: No snapshots for this policy are ever deleted, so partial snapshots have no special handling. - If
expire_afteris the only condition: At least one successful snapshot will be kept, regardless ofexpire_after. Partial snapshots are deleted after theexpire_afterperiod has passed, regardless of whether or not there is a more recent successful snapshot. - If
max_countis the only condition: At least one successful snapshot will be kept. Partial snapshots are deleted, oldest first, to keepsuccessful_snaps + partial_snapsequal to or less thanmax_count. - If
min_countandexpire_afterare configured: At leastmin_countsuccessful snapshots will be retained. Partial snapshots are deleted after theexpire_afterperiod has passed, regardless of whether or not there is a more recent successful snapshot. - If
min_countandmax_countare configured: At leastmin_countsuccessful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keepsuccessful_snaps + partial_snapsequal to or less thanmax_count. - If
expire_afterandmax_countare configured: At least one successful snapshot will be kept, regardless ofexpire_after. Partial snapshots will be deleted, oldest first, to keepsuccessful_snaps + partial_snapsequal to or less thanmax_count, as well as after theexpire_afterperiod, regardless of whether there is a more recent successful snapshot. - If all three conditions are configured: At least
min_countsuccessful snapshots will be retained. All other snapshots, whether successful or partial, will be deleted, oldest first, to keepsuccessful_snaps + partial_snapsequal to or less thanmax_count, as well as after theexpire_afterperiod has passed, regardless of whether there is a more recent successful snapshot.
Relates to #43663