Skip to content

Concurrent Snapshots Finalizing out of Order May Corrupt a Repository #75336

@original-brownbear

Description

@original-brownbear

The following test fails 100% of the time:

    public void testOutOfOrderFinalization() throws Exception {
        internalCluster().startMasterOnlyNode();
        final List<String> dataNodes = internalCluster().startDataOnlyNodes(2);
        final String index1 = "index-1";
        final String index2 = "index-2";
        createIndexWithContent(index1, dataNodes.get(0), dataNodes.get(1));
        createIndexWithContent(index2, dataNodes.get(1), dataNodes.get(0));

        final String repository = "test-repo";
        createRepository(repository, "mock");

        blockNodeWithIndex(repository, index2);

        final ActionFuture<CreateSnapshotResponse> snapshot1 = clusterAdmin()
                .prepareCreateSnapshot(repository, "snapshot-1")
                .setIndices(index1, index2)
                .setWaitForCompletion(true)
                .execute();
        awaitNumberOfSnapshotsInProgress(1);
        final ActionFuture<CreateSnapshotResponse> snapshot2 = clusterAdmin()
                .prepareCreateSnapshot(repository, "snapshot-2")
                .setIndices(index1)
                .setWaitForCompletion(true)
                .execute();
        assertSuccessful(snapshot2);
        unblockAllDataNodes(repository);
        final SnapshotInfo sn1 = assertSuccessful(snapshot1);
        assertAcked(startDeleteSnapshot(repository, sn1.snapshot().getSnapshotId().getName()).get());

        assertThat(
                clusterAdmin().prepareSnapshotStatus().setSnapshots("snapshot-2").setRepository(repository).get().getSnapshots(),
                hasSize(1)
        );
    }

=> If a shard snapshot in an earlier is successful but a later snapshot containing that shard finalizes before the earlier snapshot finalizes the shard level metadata gets corrupted in a very subtle way where the shard points at an incorrect generation but all the snap- blobs in the shard are still correct until the next delete (thus _status api calls will still work) but data blobs still may be deleted incorrectly for the shard.

=> on it fixing this.

Metadata

Metadata

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions