Skip to content

Enable More Flexible SLM Retention Policies? #65826

@original-brownbear

Description

@original-brownbear

There has been a recent request for longer snapshot retention via SLM in ECE/ECS.
This is understandable since the default of of 100 snapshots retained and taken at 30 minute intervals only gives the user a ~2 day (less than a weekend potentially ;)) window to realize a problem before the last snapshot containing a healthy cluster state ages out.

Currently available solution to increasing the retention time are obvious but sub-optimal:

  • lengthen the interval between two snapshots -> increases the risk of data loss for recently written data
  • retain more snapshots -> increases the size of the repository which means more storage + heavier load (especially memory) on the master node on every snapshot operation

A possible solution that would be to make the retention interval dynamic such that older snapshots are retained with a larger interval between them before being phased out completely.
Concretely, we could for example keep the first 10 snapshots at intervals of 30 min, then the next 10 at intervals of 1h and then the next 10 at intervals of 8h and then keep the remaining 70 at intervals of 1d or so (i.e. deletes would delete in between existing snapshots and not just delete via LIFO).

Under the assumption that time-wise resolution loses value the older a snapshot gets (which seems very reasonable to me) this would allow keeping 3 months of snapshots in 100 snapshots compared to 2 days with the current model. This keeps the cost terms of resources for managing the repository constant relative to the current approach. It does however increase the storage use due to the incremental nature of snapshots.
Take the concrete number suggested for the intervals with a grain of salt obviously, there are many options here though I believe we should keep it simple.
Technically speaking one could achieve this kind of retention period already by running multiple SLM policies in parallel I believe but that's fairly cumbersome.

-> WDYT about adding functionality for such a more dynamic snapshot retention interval?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions