Improve Snapshot Finalization Ex. Handling#49995
Improve Snapshot Finalization Ex. Handling#49995original-brownbear merged 4 commits intoelastic:masterfrom original-brownbear:49989
Conversation
Like in #49989 we can get into a situation where the setting of the repository generation (during snapshot finalization) in the cluster state fails due to master failing over. In this case we should not try to execute the next cluster state update that will remove the snapshot from the cluster state. Closes #49989
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
| Snapshot snapshot = entry.snapshot(); | ||
| logger.warn(() -> new ParameterizedMessage("[{}] failed to finalize snapshot", snapshot), e); | ||
| removeSnapshotFromClusterState(snapshot, null, e); | ||
| if (ExceptionsHelper.unwrap(e, NotMasterException.class) != null) { |
There was a problem hiding this comment.
This feels kind of dirty but IMO it's an ok best-effort work around for now. We simply aren't exposing the internals of the cluster state updates from the BlobStoreRepository and I don't see a straight-forward way of doing so with the current Repository interface. Also, this is just to help (as in improve the user experience, there's no corruption to be fixed here :)) with the incredibly unlikely corner case showing in the linked test failure were a new master is elected, immediately fails over again and is then finally elected for good.
There was a problem hiding this comment.
You should also handle FailedToCommitClusterStateException here (see also TransportMasterNodeAction). Note that I'm not sure how the wrapping comes into play here / whether we need to unwrap.
There was a problem hiding this comment.
Note that I'm not sure how the wrapping comes into play here / whether we need to unwrap.
Not sure either, so unwrapping seems safer to me
There was a problem hiding this comment.
Right, added that exception here :)
We gotta unwrap, we're wrapping everything in the CS update callbacks in a RepositoryException in BlobStoreRepository via:
public void onFailure(String source, Exception e) {
listener.onFailure(
new RepositoryException(metadata.name(), "Failed to execute cluster state update [" + source + "]", e));
}There was a problem hiding this comment.
RepositoryException is a not a WrapperException, so the unwrapping won't work. I'm confused how this fixes the test.
There was a problem hiding this comment.
I might be missing something, but this fix uses unwrap() which acts differently from unwrapCause() and does not rely on ElasticsearchWrapperException.
There was a problem hiding this comment.
org.elasticsearch.ExceptionsHelper#unwrap just loops through the getCause returns, it doesn't need WrapperException like org.elasticsearch.ExceptionsHelper#unwrapCause?
There was a problem hiding this comment.
lol, thanks for the clarification. What a confusing bunch of methods :D
|
Thanks Tanguy!
Where's that comment though :) ? |
In my browser cache of course :)
I think we should only log this if we effectively remove the snapshot from the cluster state and not warn anything if we let the next master finalizes the snapshot. |
Makes sense :) I pushed 48a6eba |
|
Thanks Tanguy & Yannick! |
* Improve Snapshot Finalization Ex. Handling Like in #49989 we can get into a situation where the setting of the repository generation (during snapshot finalization) in the cluster state fails due to master failing over. In this case we should not try to execute the next cluster state update that will remove the snapshot from the cluster state. Closes #49989
* Improve Snapshot Finalization Ex. Handling Like in elastic#49989 we can get into a situation where the setting of the repository generation (during snapshot finalization) in the cluster state fails due to master failing over. In this case we should not try to execute the next cluster state update that will remove the snapshot from the cluster state. Closes elastic#49989
Like in #49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.
Otherwise we may needlessly drop an otherwise fine snapshot from the repository
completely on a master failover via the rare case of the removing of the snapshot from the CS working out because the node that failed the generation update from the blob store repository becoming master just in time to remove the snapshot from the repository.
Note: this won't corrupt the repository, it simply needlessly fails snapshots that should just work out fine on a master failover like the case in #49989
Closes #49989