Fix Snapshot Abort Not Waiting for Data Nodes by original-brownbear · Pull Request #58214 · elastic/elasticsearch

original-brownbear · 2020-06-17T04:26:54Z

This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.

This was a really subtle bug that we introduced a long time ago. If a shard snapshot is in aborted state but hasn't started snapshotting on a node we can only send the failed notification for it if the shard was actually supposed to execute on the local node. Without this fix, if shard snapshots were spread out across at least two data nodes (so that each data node does not have all the primaries) the abort would actually never wait on the data nodes. This isn't a big deal with uuid shard generations but could lead to potential corruption on S3 when using numeric shard generations (albeit very unlikely now that we have the 3 minute wait there). Another negative side-effect of this bug was that master would receive a lot more shard status update messages for aborted shards since each data node not assigned a primary would send one message for that primary.

elasticmachine · 2020-06-17T04:26:56Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-06-17T05:05:19Z

Jenkins run elasticsearch-ci/bwc

ywelsch

Fix looks good to me. I've left one question on the test. Please double-check that we also properly handle the case where a snapshot is aborted but the node dropped out of the cluster (and that we have tests for that).

ywelsch · 2020-06-17T06:23:03Z

.../internalClusterTest/java/org/elasticsearch/snapshots/DedicatedClusterSnapshotRestoreIT.java

+                client().admin().cluster().prepareDeleteSnapshot(repoName, snapshotName).execute();
+
+        logger.info("--> wait for 5s to give data nodes some time to process the updated shard snapshot status");
+        TimeUnit.SECONDS.sleep(5L);


Can we avoid the sleep here? For example by waiting for the data node to have applied the cluster state with the aborted snapshots, and showing that no outgoing message was sent to the master?

Sure thing, sorry for being lazy there => how about 192508c ?

original-brownbear · 2020-06-17T07:16:09Z

Please double-check that we also properly handle the case where a snapshot is aborted but the node dropped out of the cluster (and that we have tests for that).

Jup this is covered by org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode as well as the occasional SnapshotResiliencyTest that hits this condition also.

ywelsch

LGTM

original-brownbear · 2020-06-17T08:37:10Z

Thanks Yannick!

This was a really subtle bug that we introduced a long time ago. If a shard snapshot is in aborted state but hasn't started snapshotting on a node we can only send the failed notification for it if the shard was actually supposed to execute on the local node. Without this fix, if shard snapshots were spread out across at least two data nodes (so that each data node does not have all the primaries) the abort would actually never wait on the data nodes. This isn't a big deal with uuid shard generations but could lead to potential corruption on S3 when using numeric shard generations (albeit very unlikely now that we have the 3 minute wait there). Another negative side-effect of this bug was that master would receive a lot more shard status update messages for aborted shards since each data node not assigned a primary would send one message for that primary.

Forgot the brackets here in #58214 so in the rare case where the first update seen by the listener doesn't match it will still remove itself and never be invoked again -> timeout.

original-brownbear added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.1 v7.9.0 labels Jun 17, 2020

elasticmachine added the Team:Distributed Meta label for distributed team. label Jun 17, 2020

original-brownbear mentioned this pull request Jun 17, 2020

Enable Fully Concurrent Snapshot Operations #56911

Merged

original-brownbear requested review from tlrx and ywelsch June 17, 2020 05:20

ywelsch reviewed Jun 17, 2020

View reviewed changes

original-brownbear added 2 commits June 17, 2020 09:39

nicer test

192508c

Merge remote-tracking branch 'elastic/master' into fix-abort-bug

7b260f9

original-brownbear requested a review from ywelsch June 17, 2020 07:40

ywelsch approved these changes Jun 17, 2020

View reviewed changes

original-brownbear merged commit 1d3032f into elastic:master Jun 17, 2020

original-brownbear deleted the fix-abort-bug branch June 17, 2020 08:37

This was referenced Jun 17, 2020

Fix Snapshot Abort Not Waiting for Data Nodes (#58214) #58228

Merged

Fix Snapshot Abort Not Waiting for Data Nodes (#58214) #58229

Merged

original-brownbear mentioned this pull request Jun 17, 2020

Fix Typo in Snapshot Abort Test #58238

Merged

This was referenced Jun 17, 2020

Fix Typo in Snapshot Abort Test (#58238) #58247

Merged

Fix Typo in Snapshot Abort Test (#58238) #58248

Merged

davidkyle mentioned this pull request Jun 18, 2020

DedicatedClusterSnapshotRestoreIT testAbortWaitsOnDataNode failure #58314

Closed

mfussenegger mentioned this pull request Jun 22, 2020

ES Backports crate/crate#9796

Closed

37 tasks

original-brownbear restored the fix-abort-bug branch August 6, 2020 18:35

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Snapshot Abort Not Waiting for Data Nodes#58214

Fix Snapshot Abort Not Waiting for Data Nodes#58214
original-brownbear merged 3 commits intoelastic:masterfrom
original-brownbear:fix-abort-bug

original-brownbear commented Jun 17, 2020

Uh oh!

elasticmachine commented Jun 17, 2020

Uh oh!

original-brownbear commented Jun 17, 2020

Uh oh!

ywelsch left a comment

Uh oh!

ywelsch Jun 17, 2020

Uh oh!

original-brownbear Jun 17, 2020

Uh oh!

original-brownbear commented Jun 17, 2020

Uh oh!

ywelsch left a comment

Uh oh!

original-brownbear commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

original-brownbear commented Jun 17, 2020

Uh oh!

elasticmachine commented Jun 17, 2020

Uh oh!

original-brownbear commented Jun 17, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

ywelsch Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Jun 17, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Jun 17, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Jun 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants