Skip to content

Improve Snapshot Finalization Ex. Handling#49995

Merged
original-brownbear merged 4 commits intoelastic:masterfrom
original-brownbear:49989
Dec 10, 2019
Merged

Improve Snapshot Finalization Ex. Handling#49995
original-brownbear merged 4 commits intoelastic:masterfrom
original-brownbear:49989

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

Like in #49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.
Otherwise we may needlessly drop an otherwise fine snapshot from the repository
completely on a master failover via the rare case of the removing of the snapshot from the CS working out because the node that failed the generation update from the blob store repository becoming master just in time to remove the snapshot from the repository.

Note: this won't corrupt the repository, it simply needlessly fails snapshots that should just work out fine on a master failover like the case in #49989

Closes #49989

Like in #49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.

Closes #49989
@original-brownbear original-brownbear added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.6.0 labels Dec 9, 2019
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

Snapshot snapshot = entry.snapshot();
logger.warn(() -> new ParameterizedMessage("[{}] failed to finalize snapshot", snapshot), e);
removeSnapshotFromClusterState(snapshot, null, e);
if (ExceptionsHelper.unwrap(e, NotMasterException.class) != null) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels kind of dirty but IMO it's an ok best-effort work around for now. We simply aren't exposing the internals of the cluster state updates from the BlobStoreRepository and I don't see a straight-forward way of doing so with the current Repository interface. Also, this is just to help (as in improve the user experience, there's no corruption to be fixed here :)) with the incredibly unlikely corner case showing in the linked test failure were a new master is elected, immediately fails over again and is then finally elected for good.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also handle FailedToCommitClusterStateException here (see also TransportMasterNodeAction). Note that I'm not sure how the wrapping comes into play here / whether we need to unwrap.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm not sure how the wrapping comes into play here / whether we need to unwrap.

Not sure either, so unwrapping seems safer to me

Copy link
Copy Markdown
Contributor Author

@original-brownbear original-brownbear Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, added that exception here :)

We gotta unwrap, we're wrapping everything in the CS update callbacks in a RepositoryException in BlobStoreRepository via:

                public void onFailure(String source, Exception e) {
                    listener.onFailure(
                        new RepositoryException(metadata.name(), "Failed to execute cluster state update [" + source + "]", e));
                }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RepositoryException is a not a WrapperException, so the unwrapping won't work. I'm confused how this fixes the test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but this fix uses unwrap() which acts differently from unwrapCause() and does not rely on ElasticsearchWrapperException.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.elasticsearch.ExceptionsHelper#unwrap just loops through the getCause returns, it doesn't need WrapperException like org.elasticsearch.ExceptionsHelper#unwrapCause?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, thanks for the clarification. What a confusing bunch of methods :D

Copy link
Copy Markdown
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a minor comment

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Tanguy!

LGTM, left a minor comment

Where's that comment though :) ?

@tlrx
Copy link
Copy Markdown
Member

tlrx commented Dec 10, 2019

Where's that comment though :) ?

In my browser cache of course :)

logger.warn(() -> new ParameterizedMessage("[{}] failed to finalize snapshot", snapshot), e);

I think we should only log this if we effectively remove the snapshot from the cluster state and not warn anything if we let the next master finalizes the snapshot.

@original-brownbear
Copy link
Copy Markdown
Contributor Author

I think we should only log this if we effectively remove the snapshot from the cluster state and not warn anything if we let the next master finalizes the snapshot.

Makes sense :) I pushed 48a6eba

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Tanguy & Yannick!

@original-brownbear original-brownbear merged commit 2605c7c into elastic:master Dec 10, 2019
@original-brownbear original-brownbear deleted the 49989 branch December 10, 2019 10:04
original-brownbear added a commit that referenced this pull request Dec 10, 2019
* Improve Snapshot Finalization Ex. Handling

Like in #49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.

Closes #49989
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
* Improve Snapshot Finalization Ex. Handling

Like in elastic#49989 we can get into a situation where the setting of
the repository generation (during snapshot finalization) in the cluster
state fails due to master failing over.
In this case we should not try to execute the next cluster state update
that will remove the snapshot from the cluster state.

Closes elastic#49989
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue v7.6.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] Failure in DedicatedClusterSnapshotRestoreIT.testMasterAndDataShutdownDuringSnapshot

5 participants