Skip to content

Fix Snapshot Abort Not Waiting for Data Nodes#58214

Merged
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-abort-bug
Jun 17, 2020
Merged

Fix Snapshot Abort Not Waiting for Data Nodes#58214
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-abort-bug

Conversation

@original-brownbear
Copy link
Copy Markdown
Contributor

This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.

This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.
@original-brownbear original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.1 v7.9.0 labels Jun 17, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Jenkins run elasticsearch-ci/bwc

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good to me. I've left one question on the test. Please double-check that we also properly handle the case where a snapshot is aborted but the node dropped out of the cluster (and that we have tests for that).

client().admin().cluster().prepareDeleteSnapshot(repoName, snapshotName).execute();

logger.info("--> wait for 5s to give data nodes some time to process the updated shard snapshot status");
TimeUnit.SECONDS.sleep(5L);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid the sleep here? For example by waiting for the data node to have applied the cluster state with the aborted snapshots, and showing that no outgoing message was sent to the master?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, sorry for being lazy there => how about 192508c ?

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Please double-check that we also properly handle the case where a snapshot is aborted but the node dropped out of the cluster (and that we have tests for that).

Jup this is covered by org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode as well as the occasional SnapshotResiliencyTest that hits this condition also.

Copy link
Copy Markdown
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@original-brownbear
Copy link
Copy Markdown
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 1d3032f into elastic:master Jun 17, 2020
@original-brownbear original-brownbear deleted the fix-abort-bug branch June 17, 2020 08:37
original-brownbear added a commit that referenced this pull request Jun 17, 2020
This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.
original-brownbear added a commit that referenced this pull request Jun 17, 2020
This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.
original-brownbear added a commit that referenced this pull request Jun 17, 2020
Forgot the brackets here in #58214 so in the rare case where the
first update seen by the listener doesn't match it will still remove
itself and never be invoked again -> timeout.
original-brownbear added a commit that referenced this pull request Jun 17, 2020
Forgot the brackets here in #58214 so in the rare case where the
first update seen by the listener doesn't match it will still remove
itself and never be invoked again -> timeout.
original-brownbear added a commit that referenced this pull request Jun 17, 2020
Forgot the brackets here in #58214 so in the rare case where the
first update seen by the listener doesn't match it will still remove
itself and never be invoked again -> timeout.
@mfussenegger mfussenegger mentioned this pull request Jun 22, 2020
37 tasks
@original-brownbear original-brownbear restored the fix-abort-bug branch August 6, 2020 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. v7.8.1 v7.9.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants