Fix RoutingTable Lookup by Index#75530
Merged
original-brownbear merged 3 commits intoelastic:masterfrom Jul 21, 2021
original-brownbear:fix-routing-table-lookup
Merged
Fix RoutingTable Lookup by Index#75530original-brownbear merged 3 commits intoelastic:masterfrom original-brownbear:fix-routing-table-lookup
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-routing-table-lookup
Conversation
This is likely one source of bugs in at least snapshotting as it can lead to looking up the wrong index from an old shard id (if an index has been deleted and a new index is created in its place concurrently)
Collaborator
|
Pinging @elastic/es-distributed (Team:Distributed) |
Contributor
Author
|
Thanks David! |
This was referenced Jul 21, 2021
original-brownbear
added a commit
that referenced
this pull request
Jul 21, 2021
original-brownbear
added a commit
that referenced
this pull request
Jul 21, 2021
ywangd
pushed a commit
to ywangd/elasticsearch
that referenced
this pull request
Jul 30, 2021
This is likely one source of bugs in at least snapshotting as it can lead to looking up the wrong index from an old shard id (if an index has been deleted and a new index is created in its place concurrently)
original-brownbear
added a commit
that referenced
this pull request
Jul 30, 2021
…#75501) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
original-brownbear
added a commit
that referenced
this pull request
Aug 16, 2021
…#75501) (#76539) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
original-brownbear
added a commit
that referenced
this pull request
Aug 16, 2021
…#75501) (#76539) (#76547) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is likely one source of bugs in at least snapshots (could lead to
org.elasticsearch.snapshots.SnapshotsService#waitingShardsStartedOrUnassignedmissing a relevant index change)as it can lead to looking up the wrong index from an old shard id (if an index has been
deleted and a new index is created in its place for the same name).