SLM as a standalone snapshot taking tool is taking shape as described in #38461. However, to fully utilize SLM, we should implement retention for the snapshots that SLM takes.
Policy definition would change to something like:
PUT /_slm/policy/snapshot-every-day
{
"schedule": "0 30 2 * * ?",
"name": "<production-snap-{now/d}>",
"repository": "my-s3-repository",
"config": {
"indices": ["foo-*", "important"]
},
// Newly configured retention options
"retention": {
// Snapshots should be deleted after 14 days
"expire_after": "14d",
// Keep a maximum of thirty snapshots
"max_count": 30,
// Keep a minimum of the four most recent snapshots
"min_count": 4
}
}
Snapshot retention would kick in based on a schedule (supporting cron expressions) and configured with the newly introduced slm.retention_schedule cluster setting. This would allow administrators to configure when snapshots are deleted (so as not to interfere with other cluster operations).
Potentially, SLM retention would need to cap the amount of time spent deleting snapshots (probably with another cluster setting) so long-running deletes don't cause issues with other cluster operations.
Potential list of snapshot conditions:
- age-based retention (delete snapshots after N days)
- minimum number of snapshots to keep
- maximum number of snapshots to allow (delete oldest if there are too many)
Some things to work out
- What should we do with FAILED/PARTIAL snapshots? Should they be treated as subject to retention? Separate retention?
For the first release, treating PARTIAL as failed and not eligible for retention
- Are there retry policies for deletion, or should we wait for the next invocation of the retention task
- Does the order of old snapshot deletion matter?
Oldest snapshots will be deleted first
Task Checklist
SLM as a standalone snapshot taking tool is taking shape as described in #38461. However, to fully utilize SLM, we should implement retention for the snapshots that SLM takes.
Policy definition would change to something like:
Snapshot retention would kick in based on a schedule (supporting cron expressions) and configured with the newly introduced
slm.retention_schedulecluster setting. This would allow administrators to configure when snapshots are deleted (so as not to interfere with other cluster operations).Potentially, SLM retention would need to cap the amount of time spent deleting snapshots (probably with another cluster setting) so long-running deletes don't cause issues with other cluster operations.
Potential list of snapshot conditions:
Some things to work out
For the first release, treating PARTIAL as failed and not eligible for retention
Oldest snapshots will be deleted first
Task Checklist
_metainCreateSnapshotRequest(@gwbrown) Add custom metadata to snapshots #41281_metaassociating each snapshot with the policy that created it (@gwbrown) Include SLM policy name in Snapshot metadata #43132slm-retention) (@dakrone) Add base framework for snapshot retention #43605SnapshotLifecyclePolicyto support retention configuration (@dakrone) Add SnapshotRetentionConfiguration for retention configuration #43777SnapshotRetentionTaskto implement snapshot deletion (@dakrone) Implement SnapshotRetentionTask's snapshot filtering and deletion #44764SnapshotRetentionConfigurationpredicates (@dakrone) Add min_count and max_count as SLM retention predicates #44926OperationMode(@dakrone) Skip SLM retention if ILM is STOPPING or STOPPED #45869Investigate retention of data in snapshots based on document/data age (put into snap meta?) instead of snapshot age+~ see: Implement retention of snapshots based on the document's timestamp date #45252FAILUREandPARTIALsnapshots Handle retention of failed and partial snapshots in SLM #46988 (@gwbrown) Manage retention of failed snapshots in SLM #47617Add cooldown period in between SLM operations Add a configurable cooldown period between SLM operations #47520 (@dakrone)