Skip to content

Add Check for Metadata Existence in BlobStoreRepository#59141

Merged
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:multi-cluster-corruption-safety-tests
Jul 8, 2020
Merged

Add Check for Metadata Existence in BlobStoreRepository#59141
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:multi-cluster-corruption-safety-tests

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Jul 7, 2020

In order to ensure that we do not write a broken piece of RepositoryData
because the physical repository generation was moved ahead more than one step
by erroneous concurrent writing to a repository we must check whether or not
the current assumed repository generation exists in the repository physically.
Without this check we run the risk of writing on top of stale cached repository data.

The exists checks are the ones we used to employ until we removed the check in 4b8fd4e so this PR is a partial revert of that change.

Relates #56911

WIP for a sec, I'd like CI to run this full first before requesting reviews

In order to ensure that we do not write a broken piece of `RepositoryData`
because the phyiscal repository generation was moved ahead more than one step
by erroneous concurrent writing to a repository we must check whether or not
the current assumed repository generation exists in the repository physically.
Without this check we run the risk of writing on top of stale cached repository data.

Relates #56911
@original-brownbear original-brownbear added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.9.0 labels Jul 7, 2020

@Override
public boolean blobExists(String blobName) throws IOException {
return store.execute(fileContext -> fileContext.util().exists(new Path(path, blobName)));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one we used to do differently in https://github.com/original-brownbear/elasticsearch/commit/4b8fd4e76f1e344d8994486f28c96d950303cf1a#diff-a6d76133025d0cd3d4c12918d42be05bL66 ... but just catching any io exception and returning false seemed broken to me so I didn't bring that back. Especially since this may now trigger marking a repository as corrupted which we don't want to happen on e.g. a network issue with HDFS.

private Path repoPath;

@Before
public void startSecondCluster() throws IOException, InterruptedException {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a little over the top since I'm not making heavy use of the second cluster now (could achieve the same test by just moving the index-N blob by two generations or mounting a second repo to the same path) but I figured it's safer and easier to maintain to do the real thing here and we can extend this to cover more spots when we tighten the safety measures further in #57786

@original-brownbear original-brownbear marked this pull request as ready for review July 7, 2020 13:23
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jul 7, 2020
Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left 3 small comments, o.w. looking good.

secondCluster.client().admin().cluster().prepareDeleteSnapshot(repoNameOnSecondCluster, "snap-1").get();
secondCluster.client().admin().cluster().prepareDeleteSnapshot(repoNameOnSecondCluster, "snap-2").get();

expectThrows(SnapshotException.class, () ->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert that the message here is something about concurrent repo access?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done :)

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 5da804b into elastic:master Jul 8, 2020
@original-brownbear original-brownbear deleted the multi-cluster-corruption-safety-tests branch July 8, 2020 11:17
original-brownbear added a commit that referenced this pull request Jul 8, 2020
)

In order to ensure that we do not write a broken piece of `RepositoryData`
because the phyiscal repository generation was moved ahead more than one step
by erroneous concurrent writing to a repository we must check whether or not
the current assumed repository generation exists in the repository physically.
Without this check we run the risk of writing on top of stale cached repository data.

Relates #56911
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team. v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants