Abort snapshots on a node that leaves the cluster#21084
Abort snapshots on a node that leaves the cluster#21084abeyad merged 9 commits intoelastic:masterfrom
Conversation
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that the shard snapshot is aborted when the node responsible for that shard's snapshot leaves the cluster. When the node rejoins the cluster, it will see in the cluster state that the snapshot for that shard is failed and abort the snapshot locally, allowing the shard data directory to be freed for allocation of a replica shard on the same node. Closes elastic#20876
ywelsch
left a comment
There was a problem hiding this comment.
I wonder if the better change would be to treat aborting snapshots in the same way as we abort outgoing peer recoveries of a primary: by registering an IndexEventListener listening for beforeIndexShardClosed and cancelling recoveries / abort ongoing snapshots at that time. This ensures that snapshots are aborted whenever we close the shard, simplifying the logic here. WDYT?
| for (DiscoveryNode node : this) { | ||
| sb.append(node).append(','); | ||
| } | ||
| if (sb.length() > 1) { |
There was a problem hiding this comment.
maybe simpler to replace
for (DiscoveryNode node : this) { sb.append(node).append(','); }
by sb.append(Strings.collectionToDelimitedString(this, ","));
aborting, removed all the network disruption stuff
|
@ywelsch the PR has been updated to use the |
ywelsch
left a comment
There was a problem hiding this comment.
The change is good but the test needs a bit more work (I'm not sure it's testing the right thing).
| String nodeWithPrimary = clusterState.nodes().get(indexRoutingTable.shard(0).primaryShard().currentNodeId()).getName(); | ||
| assertNotNull("should be at least one node with a primary shard", nodeWithPrimary); | ||
| IndicesService indicesService = internalCluster().getInstance(IndicesService.class, nodeWithPrimary); | ||
| indicesService.deleteIndex(resolveIndex(index), "trigger shard removal"); |
There was a problem hiding this comment.
removeIndex might be good enough here.
| if (snapshotsInProgress != null && snapshotsInProgress.entries().size() > 0) { | ||
| assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state()); | ||
| } | ||
| }, 10, TimeUnit.SECONDS); |
There was a problem hiding this comment.
assertBusy uses by default 10 seconds, no need to specify it here again
| Settings.builder().put("number_of_shards", numPrimaries).put("number_of_replicas", numReplicas))); | ||
|
|
||
| logger.info("--> indexing some data"); | ||
| Client client = client(); |
There was a problem hiding this comment.
why not use a random client every time?
| SnapshotsInProgress snapshotsInProgress = | ||
| client.admin().cluster().prepareState().get().getState().custom(SnapshotsInProgress.TYPE); | ||
| if (snapshotsInProgress != null && snapshotsInProgress.entries().size() > 0) { | ||
| assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state()); |
There was a problem hiding this comment.
I think this will succeed even without the change in this PR? I'm not sure what is exactly tested here.
There was a problem hiding this comment.
Without the change here, the snapshot forever stalls and the test times out, because the snapshot was never aborted. This asserts that we abort the snapshot, bringing the snapshotting to a successful conclusion.
|
|
||
| /** | ||
| * This test ensures that if a node that holds a primary that is being snapshotted leaves the cluster, | ||
| * when it returns, the node aborts the snapshotting on the now removed shard. |
There was a problem hiding this comment.
the description does not match what the test does.
| logger.info("--> waiting for snapshot to be in progress on all nodes"); | ||
| assertBusy(() -> { | ||
| for (String node : internalCluster().nodesInclude(index)) { | ||
| final Client nodeClient = client(node); |
There was a problem hiding this comment.
why use this particular client?
There was a problem hiding this comment.
I made a mistake here, this only ensures the snapshot cluster state update has reached master, so I changed it to use internalCluster().clusterService(node).state() instead, to ensure each node knows that the snapshot is in progress.
| } | ||
| }, 10, TimeUnit.SECONDS); | ||
|
|
||
| // Pick a node with a primary shard and remove the shard from the node |
There was a problem hiding this comment.
Pick node with THE primary shard
| waitForCompletion(repo, snapshot, TimeValue.timeValueSeconds(10)); | ||
|
|
||
| // make sure snapshot is aborted and the aborted shard was marked as failed | ||
| assertBusy(() -> { |
There was a problem hiding this comment.
no assertBusy needed with waitForCompletion above?
| assertEquals(State.SUCCESS, snapshotsInProgress.entries().get(0).state()); | ||
| } | ||
| }, 10, TimeUnit.SECONDS); | ||
| List<SnapshotInfo> snapshotInfos = client().admin().cluster().prepareGetSnapshots(repo).setSnapshots(snapshot).get().getSnapshots(); |
There was a problem hiding this comment.
waitForCompletion returns SnapshotInfo
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
|
5.x commit: 1d278d2 |
Previously, if a node left the cluster (for example, due to a long GC), during a snapshot, the master node would mark the snapshot as failed, but the node itself could continue snapshotting the data on its shards to the repository. If the node rejoins the cluster, the master may assign it to hold the replica shard (where it held the primary before getting kicked off the cluster). The initialization of the replica shard would repeatedly fail with a ShardLockObtainFailedException until the snapshot thread finally finishes and relinquishes the lock on the Store. This commit resolves the situation by ensuring that when a shard is removed from a node (such as when a node rejoins the cluster and realizes it no longer holds the active shard copy), any snapshotting of the removed shards is aborted. In the scenario above, when the node rejoins the cluster, it will see in the cluster state that the node no longer holds the primary shard, so IndicesClusterStateService will remove the shard, thereby causing any snapshots of that shard to be aborted. Closes #20876
|
5.0 commit: 22ee78c |
Previously, if a node left the cluster (for example, due to a long GC),
during a snapshot, the master node would mark the snapshot as failed, but
the node itself could continue snapshotting the data on its shards to the
repository. If the node rejoins the cluster, the master may assign it to
hold the replica shard (where it held the primary before getting kicked off
the cluster). The initialization of the replica shard would repeatedly fail
with a ShardLockObtainFailedException until the snapshot thread finally
finishes and relinquishes the lock on the Store.
This commit resolves the situation by ensuring that when a shard is removed
from a node (such as when a node rejoins the cluster and realizes it no longer
holds the active shard copy), any snapshotting of the removed shards is aborted.
In the scenario above, when the node rejoins the cluster, it will see in the cluster
state that the node no longer holds the primary shard, so
IndicesClusterStateServicewill remove the shard, thereby causing any snapshots of that shard to be aborted.
Closes #20876