Skip to content

Fix Snapshot Abort Not Waiting for Data Nodes (#58214)#58228

Merged
original-brownbear merged 1 commit intoelastic:7.xfrom
original-brownbear:58214-7.x
Jun 17, 2020
Merged

Fix Snapshot Abort Not Waiting for Data Nodes (#58214)#58228
original-brownbear merged 1 commit intoelastic:7.xfrom
original-brownbear:58214-7.x

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.

backport of #58214

This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.
@original-brownbear original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Jun 17, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Jun 17, 2020
@original-brownbear original-brownbear merged commit 85be78b into elastic:7.x Jun 17, 2020
@original-brownbear original-brownbear deleted the 58214-7.x branch June 17, 2020 09:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants