Fix Snapshot Abort Not Waiting for Data Nodes (#58214) by original-brownbear · Pull Request #58229 · elastic/elasticsearch

original-brownbear · 2020-06-17T08:50:20Z

This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.

backport of #58214

This was a really subtle bug that we introduced a long time ago. If a shard snapshot is in aborted state but hasn't started snapshotting on a node we can only send the failed notification for it if the shard was actually supposed to execute on the local node. Without this fix, if shard snapshots were spread out across at least two data nodes (so that each data node does not have all the primaries) the abort would actually never wait on the data nodes. This isn't a big deal with uuid shard generations but could lead to potential corruption on S3 when using numeric shard generations (albeit very unlikely now that we have the 3 minute wait there). Another negative side-effect of this bug was that master would receive a lot more shard status update messages for aborted shards since each data node not assigned a primary would send one message for that primary.

elasticmachine · 2020-06-17T08:50:22Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Jun 17, 2020

elasticmachine added the Team:Distributed Meta label for distributed team. label Jun 17, 2020

original-brownbear merged commit ff8eed4 into elastic:7.8 Jun 17, 2020

original-brownbear deleted the 58214-7.8 branch June 17, 2020 09:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Snapshot Abort Not Waiting for Data Nodes (#58214)#58229

Fix Snapshot Abort Not Waiting for Data Nodes (#58214)#58229
original-brownbear merged 1 commit intoelastic:7.8from
original-brownbear:58214-7.8

original-brownbear commented Jun 17, 2020

Uh oh!

elasticmachine commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

original-brownbear commented Jun 17, 2020

Uh oh!

elasticmachine commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants