Skip to content

Fix Bug With RepositoryData Caching#57785

Merged
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-broken-repo-data-cache
Jun 8, 2020
Merged

Fix Bug With RepositoryData Caching#57785
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-broken-repo-data-cache

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear commented Jun 7, 2020

This fixes a really subtle bug with caching RepositoryData (introduced in #52341) that can corrupt a repository.
We were caching RepositoryData serialized in the newest
metadata format. This lead to a confusing situation where
numeric shard generations would be cached in ShardGenerations
that were not written to the repository because the repository
or cluster did not yet support ShardGenerations.
In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.

Example broken scenario:

  1. Successful snapshot for metadata version 7.5 caches RepositoryData with some numeric shard generations
  2. Attempt a new snapshot but somehow fail to ever finalize it (IO exception on finalization for example) while data nodes moved the shard generation forward and deleted the old generation (that is now referenced by the cached repository metadata)

-> at this point the shards that were moved to a new generation by the data node are already not writable any more though this can be fixed by re-adding the repository or restarting the master node still at this point.

The situation gets worse though:

  1. Delete all snapshots such that the repository moves to the new 7.6+ metadata format. Now, there is a chance that the cached broken shard generations value actually makes it to the repository because the new RepositoryData that is written is based off of the cached version. Corrupting the repository since there is no fail-safe for this in the snapshot delete logic (will open a separate PR to add one).

This fixes a really subtle bug with caching `RepositoryData`
that can corrupt a repository.
We were caching `RepositoryData` serialized in the newest
metadata format. This lead to a confusing situation where
numeric shard generations would be cached in `ShardGenerations`
that were not written to the repository because the repository
or cluster did not yet support `ShardGenerations`.
In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jun 7, 2020
@original-brownbear
Copy link
Copy Markdown
Contributor Author

The workaround to this bug would be to set the repository setting cache_repository_data on any affected v7.7.0 or 7.7.1 clusters. As long as a delete hasn't yet written the broken shard generation to the repository this will fix all issues and prevent any corruption.
If broken shard generations were already written this will not resolve the issue but prevent further corruption going forward at least.

* @param version version of the repository metadata that was cached
*/
private void cacheRepositoryData(RepositoryData updated) {
private void cacheRepositoryData(RepositoryData updated, Version version) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an obvious and much cleaner fix here by simply caching the compressed form of the already serialized RepositoryData when writing/reading instead of re-serializing that would also improve efficiency. I just went with this approach now to have the smallest possible fix for a safe back-port.

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some test you could add that ensures we don't have regressions on this?

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Is there some test you could add that ensures we don't have regressions on this?

@ywelsch yes I think I have an idea on how to do that => on it

@original-brownbear
Copy link
Copy Markdown
Contributor Author

@ywelsch figured out a way of reproducing the issue in a test in f229288

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 17fe54d into elastic:master Jun 8, 2020
@original-brownbear original-brownbear deleted the fix-broken-repo-data-cache branch June 8, 2020 10:02
original-brownbear added a commit that referenced this pull request Jun 8, 2020
* Fix Bug With RepositoryData Caching

This fixes a really subtle bug with caching `RepositoryData`
that can corrupt a repository.
We were caching `RepositoryData` serialized in the newest
metadata format. This lead to a confusing situation where
numeric shard generations would be cached in `ShardGenerations`
that were not written to the repository because the repository
or cluster did not yet support `ShardGenerations`.
In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.

Relates #57798
original-brownbear added a commit that referenced this pull request Jun 8, 2020
* Fix Bug With RepositoryData Caching

This fixes a really subtle bug with caching `RepositoryData`
that can corrupt a repository.
We were caching `RepositoryData` serialized in the newest
metadata format. This lead to a confusing situation where
numeric shard generations would be cached in `ShardGenerations`
that were not written to the repository because the repository
or cluster did not yet support `ShardGenerations`.
In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.

Relates #57798
original-brownbear added a commit that referenced this pull request Jun 8, 2020
This fixes a really subtle bug with caching `RepositoryData`
that can corrupt a repository.
We were caching `RepositoryData` serialized in the newest
metadata format. This lead to a confusing situation where
numeric shard generations would be cached in `ShardGenerations`
that were not written to the repository because the repository
or cluster did not yet support `ShardGenerations`.
In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.

Relates #57798
@original-brownbear original-brownbear restored the fix-broken-repo-data-cache branch August 6, 2020 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocker >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. v7.7.2 v7.8.0 v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants