Repositories that contain both snapshots from before version 7.6 and after 7.6 can become dysfunctional and in some cases corrupted by ES v7.7 clusters as a result of a mistake in how RepositoryData is cached.
The RepositoryData is cached including ShardGenerations that include numeric generation values that might not be reliable (any failed snapshot finalization that had at least one individual shard snapshot would cause an incorrect shard generation to be tracked).
This leads to two stages of broken behavior:
- As long as there is still at least one pre-7.6 snapshot in the repository, new
ShardGenerations will not be physically written to the repository. The issue will show up in errors like while creating new snapshots of affected shards, leading to PARTIAL snapshots because the affected shards will never successfully snapshot.
[2020-06-04T00:00:00.206Z][WARN][org.elasticsearch.snapshots.SnapshotShardsService] [instance-0000000012] [[xxx][0]][xxx:xxx/Lnmw3145RGubRA7oiWqUsg] failed to snapshot shard
java.nio.file.NoSuchFileException: Blob [snapshots/585b4ab8ad5e44d8a8144df80846222b/indices/IzG2oACDQbiD_1479qE4IA/0/index-7866] does not exist
Also, snapshot deletes will log the same error, but will work otherwise. This leads to the second stage of the issue described below.
At this stage of the problem, the repository can be fixed and further corruption prevented by setting the setting the repository setting cache_repository_data to false.
- Once all the pre-7.6 snapshots have been deleted from a repository the broken
RepositoryGenerations that were incorrectly cached, will be written to the repository physically.
Once this has happened the repository is physically corrupted and the only way to fix it at this point is to delete all snapshots referencing the broken shards.
We will do two steps of fixing things here:
cc @ywelsch , @paulcoghlan
Repositories that contain both snapshots from before version 7.6 and after 7.6 can become dysfunctional and in some cases corrupted by ES v7.7 clusters as a result of a mistake in how
RepositoryDatais cached.The
RepositoryDatais cached includingShardGenerationsthat include numeric generation values that might not be reliable (any failed snapshot finalization that had at least one individual shard snapshot would cause an incorrect shard generation to be tracked).This leads to two stages of broken behavior:
ShardGenerationswill not be physically written to the repository. The issue will show up in errors like while creating new snapshots of affected shards, leading toPARTIALsnapshots because the affected shards will never successfully snapshot.Also, snapshot deletes will log the same error, but will work otherwise. This leads to the second stage of the issue described below.
At this stage of the problem, the repository can be fixed and further corruption prevented by setting the setting the repository setting
cache_repository_datatofalse.RepositoryGenerationsthat were incorrectly cached, will be written to the repository physically.Once this has happened the repository is physically corrupted and the only way to fix it at this point is to delete all snapshots referencing the broken shards.
We will do two steps of fixing things here:
RepositoryDatacaching Fix Bug With RepositoryData Caching #57785RepositoryDataso that upgrading restores the affected repositories to full functionalitycc @ywelsch , @paulcoghlan