Concurrent Snapshots Finalizing out of Order May Corrupt a Repository

The following test fails 100% of the time:

```java
    public void testOutOfOrderFinalization() throws Exception {
        internalCluster().startMasterOnlyNode();
        final List<String> dataNodes = internalCluster().startDataOnlyNodes(2);
        final String index1 = "index-1";
        final String index2 = "index-2";
        createIndexWithContent(index1, dataNodes.get(0), dataNodes.get(1));
        createIndexWithContent(index2, dataNodes.get(1), dataNodes.get(0));

        final String repository = "test-repo";
        createRepository(repository, "mock");

        blockNodeWithIndex(repository, index2);

        final ActionFuture<CreateSnapshotResponse> snapshot1 = clusterAdmin()
                .prepareCreateSnapshot(repository, "snapshot-1")
                .setIndices(index1, index2)
                .setWaitForCompletion(true)
                .execute();
        awaitNumberOfSnapshotsInProgress(1);
        final ActionFuture<CreateSnapshotResponse> snapshot2 = clusterAdmin()
                .prepareCreateSnapshot(repository, "snapshot-2")
                .setIndices(index1)
                .setWaitForCompletion(true)
                .execute();
        assertSuccessful(snapshot2);
        unblockAllDataNodes(repository);
        final SnapshotInfo sn1 = assertSuccessful(snapshot1);
        assertAcked(startDeleteSnapshot(repository, sn1.snapshot().getSnapshotId().getName()).get());

        assertThat(
                clusterAdmin().prepareSnapshotStatus().setSnapshots("snapshot-2").setRepository(repository).get().getSnapshots(),
                hasSize(1)
        );
    }
```

=> If a shard snapshot in an earlier is successful but a later snapshot containing that shard finalizes before the earlier snapshot finalizes the shard level metadata gets corrupted in a very subtle way where the shard points at an incorrect generation but all the `snap-` blobs in the shard are still correct until the next delete (thus `_status` api calls will still work) but data blobs still may be deleted incorrectly for the shard.

=>  on it fixing this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent Snapshots Finalizing out of Order May Corrupt a Repository #75336

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Concurrent Snapshots Finalizing out of Order May Corrupt a Repository #75336

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions