Fix Bug With RepositoryData Caching#57785
Fix Bug With RepositoryData Caching#57785original-brownbear merged 3 commits intoelastic:masterfrom original-brownbear:fix-broken-repo-data-cache
Conversation
This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
The workaround to this bug would be to set the repository setting |
| * @param version version of the repository metadata that was cached | ||
| */ | ||
| private void cacheRepositoryData(RepositoryData updated) { | ||
| private void cacheRepositoryData(RepositoryData updated, Version version) { |
There was a problem hiding this comment.
There is an obvious and much cleaner fix here by simply caching the compressed form of the already serialized RepositoryData when writing/reading instead of re-serializing that would also improve efficiency. I just went with this approach now to have the smallest possible fix for a safe back-port.
ywelsch
left a comment
There was a problem hiding this comment.
Is there some test you could add that ensures we don't have regressions on this?
@ywelsch yes I think I have an idea on how to do that => on it |
|
Thanks Yannick! |
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
* Fix Bug With RepositoryData Caching This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
This fixes a really subtle bug with caching `RepositoryData` that can corrupt a repository. We were caching `RepositoryData` serialized in the newest metadata format. This lead to a confusing situation where numeric shard generations would be cached in `ShardGenerations` that were not written to the repository because the repository or cluster did not yet support `ShardGenerations`. In the case where shard generations are not actually supported yet, these cached numeric generations are not safe and there's multiple scenarios where they would be incorrect, leading to the repository trying to read shard level metadata from index-N that don't exist. This commit makes it so that cached metadata is always in the same format as the metadata in the repository. Relates #57798
This fixes a really subtle bug with caching
RepositoryData(introduced in #52341) that can corrupt a repository.We were caching
RepositoryDataserialized in the newestmetadata format. This lead to a confusing situation where
numeric shard generations would be cached in
ShardGenerationsthat were not written to the repository because the repository
or cluster did not yet support
ShardGenerations.In the case where shard generations are not actually supported yet,
these cached numeric generations are not safe and there's multiple
scenarios where they would be incorrect, leading to the repository
trying to read shard level metadata from index-N that don't exist.
This commit makes it so that cached metadata is always in the same
format as the metadata in the repository.
Example broken scenario:
RepositoryDatawith some numeric shard generations-> at this point the shards that were moved to a new generation by the data node are already not writable any more though this can be fixed by re-adding the repository or restarting the master node still at this point.
The situation gets worse though:
RepositoryDatathat is written is based off of the cached version. Corrupting the repository since there is no fail-safe for this in the snapshot delete logic (will open a separate PR to add one).