Skip to content

Fail Snapshot Delete if Metadata can't be Read#57786

Closed
original-brownbear wants to merge 3 commits intoelastic:mainfrom
original-brownbear:mark-repo-corrupted-shard-gen-failure
Closed

Fail Snapshot Delete if Metadata can't be Read#57786
original-brownbear wants to merge 3 commits intoelastic:mainfrom
original-brownbear:mark-repo-corrupted-shard-gen-failure

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Jun 7, 2020

In case we use shard generation UUIDS, there is no good reason
whatsoever to be resilient/lenient here when deleting a snapshot.
All we do is put writes on top of a corrupted repository state.
In line with the behavior elsewhere in snapshotting, we should
fail-fast here to stop further writes to the repository in a situation
where something clearly went wrong.
Also, by failing fast here we prevent garbage from accumulating in the repository because it allows the user to rerun the delete if it failed because of some transient IO issue (we couldn't/can't do this in the old metadata format because we update the root level RepositoryData first).

Relates #57785 (the impact of this bug would've been much lower with this change,
it's not as easy to make a similar change to snapshot creation but I'm looking into it as well).

In case we use shard generation uuids, there is no good reason
whatsoever to be resilient/lenient here when deleting a snapshot.
All we do is put writes on top of a corrupted repository state.
In line with the behavior elsewhere in snapshotting, we should
fail-fast here to stop further writes to the repository.
@original-brownbear original-brownbear added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.9.0 labels Jun 7, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jun 7, 2020
Copy link
Copy Markdown
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided if we should do this. Let's discuss it in the coming days.

@original-brownbear
Copy link
Copy Markdown
Contributor Author

@ywelsch it's been a while on this one. Given the recent discussions on hardening the repo I'm a bigger fan than ever of this one. I really don't see how anything good could come out of deleting from a repo when a shard folder isn't readable properly. You can still get rid of the snapshots with this change by deleting all snapshots referencing the shard, but you can't just quietly keep going in this broken state => I think it's a neat safe-guard against concurrent writing, especially (but not limited to) on S3.

@ywelsch ywelsch removed their request for review August 26, 2021 13:34
@arteam arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:13
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

closing this now, this is irrelevant since we decided on doing #89163 which makes this impossible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. team-discuss v7.9.0 v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants