Retention for Snapshot Lifecycle Management

SLM as a standalone snapshot taking tool is taking shape as described in #38461. However, to fully utilize SLM, we should implement retention for the snapshots that SLM takes.

Policy definition would change to something like:

```
PUT /_slm/policy/snapshot-every-day
{
  "schedule": "0 30 2 * * ?",
  "name": "<production-snap-{now/d}>",
  "repository": "my-s3-repository",
  "config": {
    "indices": ["foo-*", "important"]
  },
  // Newly configured retention options
  "retention": {
    // Snapshots should be deleted after 14 days
    "expire_after": "14d",
    // Keep a maximum of thirty snapshots
    "max_count": 30,
    // Keep a minimum of the four most recent snapshots
    "min_count": 4
  }
}
```

Snapshot retention would kick in based on a schedule (supporting cron expressions) and configured with the newly introduced `slm.retention_schedule` cluster setting. This would allow administrators to configure when snapshots are deleted (so as not to interfere with other cluster operations).

Potentially, SLM retention would need to cap the amount of time spent deleting snapshots (probably with another cluster setting) so long-running deletes don't cause issues with other cluster operations.

Potential list of snapshot conditions:
- age-based retention (delete snapshots after N days)
- minimum number of snapshots to keep
- maximum number of snapshots to allow (delete oldest if there are too many)

Some things to work out
- What should we do with FAILED/PARTIAL snapshots? Should they be treated as subject to retention? Separate retention?

For the first release, treating PARTIAL as failed and not eligible for retention

- Are there retry policies for deletion, or should we wait for the next invocation of the retention task
- Does the order of old snapshot deletion matter?

Oldest snapshots will be deleted first

------------------

Task Checklist
--------------

- [x] Add support for `_meta` in `CreateSnapshotRequest` (@gwbrown) #41281
- [x] Send `_meta` associating each snapshot with the policy that created it (@gwbrown) #43132
- [x] Create the feature branch (`slm-retention`) (@dakrone) #43605
- [x] Modify `SnapshotLifecyclePolicy` to support retention configuration (@dakrone) #43777
- [x] Modify `SnapshotRetentionTask` to implement snapshot deletion (@dakrone) #44764
- [x] Implement the rest of the `SnapshotRetentionConfiguration` predicates (@dakrone) #44926
- [x] Add separate API reporting of retention statistics/information (@dakrone) #45362
  - [x] Add HLRC support for SLM stats endpoint (@dakrone) #45989
- [x] Add per-policy retention metrics to the GetSnapshotLifecyclePolicy API (@dakrone) #45989
- [x] Make retention wait for ongoing snapshots before attempting to delete (@dakrone) #45802
- [x] Store snapshot retention actions in the SLM history index (@gwbrown) #45513
  - [ ] Add or document creating a Watch that allows reporting failed SLM retention info (@gwbrown)
- [x] Time-bound time spent in retention snapshot deletion (@dakrone) #45065
- [x] Add version checks to ensure we are compatible with 7.4
- [x] Ensure SLM retention obeys the ILM stop/start `OperationMode` (@dakrone) #45869
- ~Investigate retention of data in snapshots based on *document/data* age (put into snap meta?) instead of snapshot age~+~ see: #45252
- [x] Update SLM stats not to use dynamic key names (@gwbrown) #46991
- [x] Decide on treatment of `FAILURE` and `PARTIAL` snapshots #46988 (@gwbrown) #47617
- [x] Merge to master
- [x] Add API to execute retention manually (@dakrone) #47405
- ~Add cooldown period in between SLM operations #47520 (@dakrone)~
- [x] Documentation (@dakrone) #47545
- [x] Decide on a default retention schedule (@dakrone) #47604
- [x] Separate start/stop/status API from ILM (@dakrone) #47710
- [ ] Testing
  - [x] Tests with security (@dakrone) #47608
  - [x] Tests that we handle ongoing snapshots and/or deletion correctly (added in #45802)
  - [ ] Test BWC and cluster restarting (to avoid bugs like #46499) (@dakrone)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retention for Snapshot Lifecycle Management #43663

Task Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Retention for Snapshot Lifecycle Management #43663

Description

Task Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions