Skip to content

Manage retention of failed snapshots in SLM#47617

Merged
AthenaEryma merged 9 commits intoelastic:masterfrom
AthenaEryma:slm/retain-failed-snaps
Oct 8, 2019
Merged

Manage retention of failed snapshots in SLM#47617
AthenaEryma merged 9 commits intoelastic:masterfrom
AthenaEryma:slm/retain-failed-snaps

Conversation

@AthenaEryma
Copy link
Copy Markdown
Contributor

@AthenaEryma AthenaEryma commented Oct 4, 2019

Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: FAILED snapshots will be kept until the configured
expire_after period has passed, if present, and then be deleted. If
there is no configured expire_after in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either min_count
or max_count.

Implements part of #46988

Labelled non-issue because this feature hasn't yet shipped.

Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
@AthenaEryma AthenaEryma added >non-issue :Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. v8.0.0 v7.5.0 labels Oct 4, 2019
@AthenaEryma AthenaEryma requested a review from dakrone October 4, 2019 23:33
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM)

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

I tried to write an integration test for this, but had a heck of a time getting a failed snapshot that actually ended up in the repository. I'll look a bit more to see if there's a way I'm missing.

Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Gordon, I left some comments. In terms of an integration test, I think you might be able to use MockRepository to "block" or cause errors during the snapshot the way that SLMSnapshotBlockingIntegTests does?

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

I've pushed most of the changes but would still like to get an integration test in - I believe I've found a way - so you can hold off on re-reviewing until I get that in.

@AthenaEryma AthenaEryma requested a review from dakrone October 8, 2019 00:09
Copy link
Copy Markdown
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for splitting some of the logic, it was easier to follow this time around.


logger.info("--> start snapshot");
ActionFuture<ExecuteSnapshotLifecycleAction.Response> snapshotFuture = client()
.execute(ExecuteSnapshotLifecycleAction.INSTANCE, new ExecuteSnapshotLifecycleAction.Request(policyId));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an executePolicy helper that returns the snapshot name as a String (for future tests)

@AthenaEryma
Copy link
Copy Markdown
Contributor Author

org.gradle.internal.remote.internal.ConnectException: Could not connect to server [8c6184eb-af0f-4e53-8f8b-d35a4142e201 port:42333, addresses:[/0:0:0:0:0:0:0:1, /127.0.0.1]]. Tried addresses: [/0:0:0:0:0:0:0:1, /127.0.0.1]. 🤷‍♂

@elasticmachine run elasticsearch-ci/1

@AthenaEryma AthenaEryma merged commit e221f86 into elastic:master Oct 8, 2019
AthenaEryma added a commit to AthenaEryma/elasticsearch that referenced this pull request Oct 8, 2019
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
AthenaEryma added a commit that referenced this pull request Oct 8, 2019
Failed snapshots will eventually build up unless they are deleted. While
failures may not take up much space, they add noise to the list of
snapshots and it's desirable to remove them when they are no longer
useful.

With this change, failed snapshots are deleted using the following
strategy: `FAILED` snapshots will be kept until the configured
`expire_after` period has passed, if present, and then be deleted. If
there is no configured `expire_after` in the retention policy, then they
will be deleted if there is at least one more recent successful snapshot
from this policy (as they may otherwise be useful for troubleshooting
purposes). Failed snapshots are not counted towards either `min_count`
or `max_count`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Data Management/ILM+SLM DO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead. >non-issue v7.5.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants