Refactor SnapshotsInProgress to Use RepositoryId for Concurency Logic(#75501) by original-brownbear · Pull Request #76539 · elastic/elasticsearch

original-brownbear · 2021-08-15T19:09:14Z

This refactors the snapshots-in-progress logic to work from RepositoryShardId when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time.
But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530.

These issues all come from the fact that ShardId is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this:

snapshot-1 for index-A with uuid-A runs and is partial
index-A is deleted and re-created and now has uuid-B
snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index
snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id
- this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id
- this change fixes all these spots by always taking the round trip via RepositoryShardId

planned follow-ups here are:

dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps
serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time
- refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to Index map to work out what exactly is being snapshotted
refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct

closes #75423

relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

backport of #75501

…#75501) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

elasticmachine · 2021-08-15T19:09:17Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-08-16T03:46:34Z

Jenkins test this

original-brownbear · 2021-08-16T04:21:38Z

Jenkins run elasticsearch-ci/packaging-tests-unix-sample

original-brownbear · 2021-08-16T04:21:49Z

Jekins run elasticsearch-ci/part-1

original-brownbear · 2021-08-16T04:22:10Z

Jenkins run elasticsearch-ci/packaging-tests-windows-sample

original-brownbear · 2021-08-16T04:42:41Z

Jenkins run elasticsearch-ci/part-1

…#75501) (#76539) (#76547) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Aug 15, 2021

elasticmachine added the Team:Distributed Meta label for distributed team. label Aug 15, 2021

elasticsearchmachine added the v7.15.0 label Aug 15, 2021

original-brownbear merged commit 57092cc into elastic:7.x Aug 16, 2021

original-brownbear deleted the 75501-7.x branch August 16, 2021 05:25

henningandersen mentioned this pull request Aug 16, 2021

[CI] MixedClusterClientYamlTestSuiteIT failing (crash due to assertion) #76552

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SnapshotsInProgress to Use RepositoryId for Concurency Logic(#75501)#76539

Refactor SnapshotsInProgress to Use RepositoryId for Concurency Logic(#75501)#76539
original-brownbear merged 1 commit intoelastic:7.xfrom
original-brownbear:75501-7.x

original-brownbear commented Aug 15, 2021

Uh oh!

elasticmachine commented Aug 15, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

original-brownbear commented Aug 15, 2021

Uh oh!

elasticmachine commented Aug 15, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

original-brownbear commented Aug 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants