Fix Snapshot Abort Not Waiting for Data Nodes#58214
Fix Snapshot Abort Not Waiting for Data Nodes#58214original-brownbear merged 3 commits intoelastic:masterfrom original-brownbear:fix-abort-bug
Conversation
This was a really subtle bug that we introduced a long time ago. If a shard snapshot is in aborted state but hasn't started snapshotting on a node we can only send the failed notification for it if the shard was actually supposed to execute on the local node. Without this fix, if shard snapshots were spread out across at least two data nodes (so that each data node does not have all the primaries) the abort would actually never wait on the data nodes. This isn't a big deal with uuid shard generations but could lead to potential corruption on S3 when using numeric shard generations (albeit very unlikely now that we have the 3 minute wait there). Another negative side-effect of this bug was that master would receive a lot more shard status update messages for aborted shards since each data node not assigned a primary would send one message for that primary.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
|
Jenkins run elasticsearch-ci/bwc |
ywelsch
left a comment
There was a problem hiding this comment.
Fix looks good to me. I've left one question on the test. Please double-check that we also properly handle the case where a snapshot is aborted but the node dropped out of the cluster (and that we have tests for that).
| client().admin().cluster().prepareDeleteSnapshot(repoName, snapshotName).execute(); | ||
|
|
||
| logger.info("--> wait for 5s to give data nodes some time to process the updated shard snapshot status"); | ||
| TimeUnit.SECONDS.sleep(5L); |
There was a problem hiding this comment.
Can we avoid the sleep here? For example by waiting for the data node to have applied the cluster state with the aborted snapshots, and showing that no outgoing message was sent to the master?
There was a problem hiding this comment.
Sure thing, sorry for being lazy there => how about 192508c ?
Jup this is covered by |
|
Thanks Yannick! |
This was a really subtle bug that we introduced a long time ago. If a shard snapshot is in aborted state but hasn't started snapshotting on a node we can only send the failed notification for it if the shard was actually supposed to execute on the local node. Without this fix, if shard snapshots were spread out across at least two data nodes (so that each data node does not have all the primaries) the abort would actually never wait on the data nodes. This isn't a big deal with uuid shard generations but could lead to potential corruption on S3 when using numeric shard generations (albeit very unlikely now that we have the 3 minute wait there). Another negative side-effect of this bug was that master would receive a lot more shard status update messages for aborted shards since each data node not assigned a primary would send one message for that primary.
This was a really subtle bug that we introduced a long time ago. If a shard snapshot is in aborted state but hasn't started snapshotting on a node we can only send the failed notification for it if the shard was actually supposed to execute on the local node. Without this fix, if shard snapshots were spread out across at least two data nodes (so that each data node does not have all the primaries) the abort would actually never wait on the data nodes. This isn't a big deal with uuid shard generations but could lead to potential corruption on S3 when using numeric shard generations (albeit very unlikely now that we have the 3 minute wait there). Another negative side-effect of this bug was that master would receive a lot more shard status update messages for aborted shards since each data node not assigned a primary would send one message for that primary.
Forgot the brackets here in #58214 so in the rare case where the first update seen by the listener doesn't match it will still remove itself and never be invoked again -> timeout.
This was a really subtle bug that we introduced a long time ago.
If a shard snapshot is in aborted state but hasn't started snapshotting on a node
we can only send the failed notification for it if the shard was actually supposed
to execute on the local node.
Without this fix, if shard snapshots were spread out across at least two data nodes
(so that each data node does not have all the primaries) the abort would actually
never wait on the data nodes. This isn't a big deal with uuid shard generations
but could lead to potential corruption on S3 when using numeric shard generations
(albeit very unlikely now that we have the 3 minute wait there).
Another negative side-effect of this bug was that master would receive a lot more
shard status update messages for aborted shards since each data node not assigned
a primary would send one message for that primary.