Snapshot Repositories Containing a Mix of pre and post v7.6 Snapshots Can Become Corrupted

Repositories that contain both snapshots from before version 7.6 and after 7.6 can become dysfunctional and in some cases corrupted by ES v7.7 clusters as a result of a mistake in how `RepositoryData` is cached. 
The `RepositoryData` is cached including `ShardGenerations` that include numeric generation values that might not be reliable (any failed snapshot finalization that had at least one individual shard snapshot would cause an incorrect shard generation to be tracked).

This leads to two stages of broken behavior:

1. As long as there is still at least one pre-7.6 snapshot in the repository, new `ShardGenerations` will not be physically written to the repository. The issue will show up in errors like while creating new snapshots of affected shards, leading to `PARTIAL` snapshots because the affected shards will never successfully snapshot. 

```
 [2020-06-04T00:00:00.206Z][WARN][org.elasticsearch.snapshots.SnapshotShardsService] [instance-0000000012] [[xxx][0]][xxx:xxx/Lnmw3145RGubRA7oiWqUsg] failed to snapshot shard
java.nio.file.NoSuchFileException: Blob [snapshots/585b4ab8ad5e44d8a8144df80846222b/indices/IzG2oACDQbiD_1479qE4IA/0/index-7866] does not exist
```

Also, snapshot deletes will log the same error, but will work otherwise. This leads to the second stage of the issue described below.
At this stage of the problem, the repository can be fixed and further corruption prevented by setting the setting the repository setting `cache_repository_data` to `false`.

2. Once all the pre-7.6 snapshots have been deleted from a repository the broken `RepositoryGenerations` that were incorrectly cached, will be written to the repository physically.
Once this has happened the repository is physically corrupted and the only way to fix it at this point is to delete all snapshots referencing the broken shards.

We will do two steps of fixing things here:

- [x] Fix the `RepositoryData` caching #57785 
- [x] Add logic to automatically fix affected `RepositoryData` so that upgrading restores the affected repositories to full functionality

cc @ywelsch , @paulcoghlan 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot Repositories Containing a Mix of pre and post v7.6 Snapshots Can Become Corrupted #57798

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Snapshot Repositories Containing a Mix of pre and post v7.6 Snapshots Can Become Corrupted #57798

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions