Improve Snapshot Finalization Ex. Handling by original-brownbear · Pull Request #49995 · elastic/elasticsearch

original-brownbear · 2019-12-09T20:04:20Z

Like in #49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.
Otherwise we may needlessly drop an otherwise fine snapshot from the repository
completely on a master failover via the rare case of the removing of the snapshot from the CS working out because the node that failed the generation update from the blob store repository becoming master just in time to remove the snapshot from the repository.

Note: this won't corrupt the repository, it simply needlessly fails snapshots that should just work out fine on a master failover like the case in #49989

Closes #49989

Like in #49989 we can get into a situation where the setting of the repository generation (during snapshot finalization) in the cluster state fails due to master failing over. In this case we should not try to execute the next cluster state update that will remove the snapshot from the cluster state. Closes #49989

elasticmachine · 2019-12-09T20:04:22Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2019-12-09T20:48:24Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

                Snapshot snapshot = entry.snapshot();
                logger.warn(() -> new ParameterizedMessage("[{}] failed to finalize snapshot", snapshot), e);
-                removeSnapshotFromClusterState(snapshot, null, e);
+                if (ExceptionsHelper.unwrap(e, NotMasterException.class) != null) {


This feels kind of dirty but IMO it's an ok best-effort work around for now. We simply aren't exposing the internals of the cluster state updates from the BlobStoreRepository and I don't see a straight-forward way of doing so with the current Repository interface. Also, this is just to help (as in improve the user experience, there's no corruption to be fixed here :)) with the incredibly unlikely corner case showing in the linked test failure were a new master is elected, immediately fails over again and is then finally elected for good.

You should also handle FailedToCommitClusterStateException here (see also TransportMasterNodeAction). Note that I'm not sure how the wrapping comes into play here / whether we need to unwrap.

Note that I'm not sure how the wrapping comes into play here / whether we need to unwrap.

Not sure either, so unwrapping seems safer to me

Right, added that exception here :)

We gotta unwrap, we're wrapping everything in the CS update callbacks in a RepositoryException in BlobStoreRepository via:

public void onFailure(String source, Exception e) { listener.onFailure( new RepositoryException(metadata.name(), "Failed to execute cluster state update [" + source + "]", e)); }

RepositoryException is a not a WrapperException, so the unwrapping won't work. I'm confused how this fixes the test.

I might be missing something, but this fix uses unwrap() which acts differently from unwrapCause() and does not rely on ElasticsearchWrapperException.

org.elasticsearch.ExceptionsHelper#unwrap just loops through the getCause returns, it doesn't need WrapperException like org.elasticsearch.ExceptionsHelper#unwrapCause?

lol, thanks for the clarification. What a confusing bunch of methods :D

tlrx

LGTM, left a minor comment

original-brownbear · 2019-12-10T08:47:59Z

Thanks Tanguy!

LGTM, left a minor comment

Where's that comment though :) ?

tlrx · 2019-12-10T08:52:36Z

Where's that comment though :) ?

In my browser cache of course :)

logger.warn(() -> new ParameterizedMessage("[{}] failed to finalize snapshot", snapshot), e);

I think we should only log this if we effectively remove the snapshot from the cluster state and not warn anything if we let the next master finalizes the snapshot.

original-brownbear · 2019-12-10T08:59:01Z

I think we should only log this if we effectively remove the snapshot from the cluster state and not warn anything if we let the next master finalizes the snapshot.

Makes sense :) I pushed 48a6eba

ywelsch

LGTM

original-brownbear · 2019-12-10T10:04:22Z

Thanks Tanguy & Yannick!

* Improve Snapshot Finalization Ex. Handling Like in #49989 we can get into a situation where the setting of the repository generation (during snapshot finalization) in the cluster state fails due to master failing over. In this case we should not try to execute the next cluster state update that will remove the snapshot from the cluster state. Closes #49989

* Improve Snapshot Finalization Ex. Handling Like in elastic#49989 we can get into a situation where the setting of the repository generation (during snapshot finalization) in the cluster state fails due to master failing over. In this case we should not try to execute the next cluster state update that will remove the snapshot from the cluster state. Closes elastic#49989

original-brownbear added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.6.0 labels Dec 9, 2019

original-brownbear commented Dec 9, 2019

View reviewed changes

original-brownbear requested review from tlrx and ywelsch December 9, 2019 20:48

tlrx approved these changes Dec 10, 2019

View reviewed changes

original-brownbear added 2 commits December 10, 2019 09:48

Merge remote-tracking branch 'elastic/master' into 49989

de19f9a

CR: add FailedToCommitClusterStateException

e1ecc2a

CR: adjust logging

48a6eba

ywelsch approved these changes Dec 10, 2019

View reviewed changes

original-brownbear merged commit 2605c7c into elastic:master Dec 10, 2019

original-brownbear deleted the 49989 branch December 10, 2019 10:04

original-brownbear mentioned this pull request Dec 10, 2019

Improve Snapshot Finalization Ex. Handling (#49995) #50017

Merged

original-brownbear mentioned this pull request Dec 13, 2019

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Conversation

original-brownbear commented Dec 9, 2019

Uh oh!

elasticmachine commented Dec 9, 2019

Uh oh!

original-brownbear Dec 9, 2019

Choose a reason for hiding this comment

Uh oh!

ywelsch Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

tlrx Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

original-brownbear Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

tlrx Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

original-brownbear Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

ywelsch Dec 10, 2019

Choose a reason for hiding this comment

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 10, 2019

Uh oh!

tlrx commented Dec 10, 2019

Uh oh!

original-brownbear commented Dec 10, 2019

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Dec 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

original-brownbear Dec 10, 2019 •

edited

Loading