Fail Snapshot Delete if Metadata can't be Read#57786
Fail Snapshot Delete if Metadata can't be Read#57786original-brownbear wants to merge 3 commits intoelastic:mainfrom original-brownbear:mark-repo-corrupted-shard-gen-failure
Conversation
In case we use shard generation uuids, there is no good reason whatsoever to be resilient/lenient here when deleting a snapshot. All we do is put writes on top of a corrupted repository state. In line with the behavior elsewhere in snapshotting, we should fail-fast here to stop further writes to the repository.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
ywelsch
left a comment
There was a problem hiding this comment.
I'm undecided if we should do this. Let's discuss it in the coming days.
|
@ywelsch it's been a while on this one. Given the recent discussions on hardening the repo I'm a bigger fan than ever of this one. I really don't see how anything good could come out of deleting from a repo when a shard folder isn't readable properly. You can still get rid of the snapshots with this change by deleting all snapshots referencing the shard, but you can't just quietly keep going in this broken state => I think it's a neat safe-guard against concurrent writing, especially (but not limited to) on S3. |
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
closing this now, this is irrelevant since we decided on doing #89163 which makes this impossible |
In case we use shard generation UUIDS, there is no good reason
whatsoever to be resilient/lenient here when deleting a snapshot.
All we do is put writes on top of a corrupted repository state.
In line with the behavior elsewhere in snapshotting, we should
fail-fast here to stop further writes to the repository in a situation
where something clearly went wrong.
Also, by failing fast here we prevent garbage from accumulating in the repository because it allows the user to rerun the delete if it failed because of some transient IO issue (we couldn't/can't do this in the old metadata format because we update the root level
RepositoryDatafirst).Relates #57785 (the impact of this bug would've been much lower with this change,
it's not as easy to make a similar change to snapshot creation but I'm looking into it as well).