Do not cancel ongoing recovery for noop copy on broken node#48265
Do not cancel ongoing recovery for noop copy on broken node#48265dnhatn merged 14 commits intoelastic:masterfrom
Conversation
|
Pinging @elastic/es-distributed (:Distributed/Allocation) |
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Show resolved
Hide resolved
DaveCTurner
left a comment
There was a problem hiding this comment.
My concern with using a simple boolean is that it will suppress no-op recoveries to other nodes that might succeed. I think we should track which node failed the no-op allocation.
What would happen if instead we kept track of all the nodes on which this shard failed during initialization (whether no-op or otherwise) and ignored them all in the ReplicaShardAllocator? I am thinking particularly of #18417 - with a list of past failures we could also prefer to avoid those nodes in the BalancedShardsAllocator.
|
@DaveCTurner Thanks for looking. We would make better decisions with a list of failed nodes. We need to make sure that the list is bounded. I used a boolean to avoid adding more load to the cluster state. |
|
I think a |
|
@DaveCTurner Good point. I will apply your feedback. Thank you. |
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
|
@DaveCTurner It's ready again. Can you please take another look? Thank you. |
DaveCTurner
left a comment
There was a problem hiding this comment.
Sorry for the delayed review @dnhatn, I thought I submitted this a few days ago but apparently not.
server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
henningandersen
left a comment
There was a problem hiding this comment.
Thanks for working on this @dnhatn, I left a few comments to consider.
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/gateway/ReplicaShardAllocator.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/gateway/ReplicaShardAllocatorIT.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/UnassignedInfo.java
Outdated
Show resolved
Hide resolved
DaveCTurner
left a comment
There was a problem hiding this comment.
I left a couple of responses to threads in an earlier review and duplicated them here.
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/routing/allocation/AllocationService.java
Outdated
Show resolved
Hide resolved
|
@DaveCTurner I've addressed your feedback. Would you mind taking another look? Thank you. |
|
Thanks @gwbrown. @elasticmachine run elasticsearch-ci/2 |
|
@DaveCTurner @henningandersen @original-brownbear Thanks for reviewing :). |
This change fixes a poisonous situation where an ongoing recovery was canceled because a better copy was found on a node that the cluster had previously tried allocating the shard to but failed. The solution is to keep track of the set of nodes that an allocation was failed on so that we can avoid canceling the current recovery for a copy on failed nodes. Closes #47974
This change fixes a poisonous situation where an ongoing recovery was canceled because a better copy was found on a node that the cluster had previously tried allocating the shard to but failed. The solution is to keep track of the set of nodes that an allocation was failed on so that we can avoid canceling the current recovery for a copy on failed nodes. Closes #47974
Today the replica allocator can repeatedly cancel an ongoing recovery for a copy on a broken node if that copy can perform a noop recovery. This loop can be happen endlessly (see testDoNotInfinitelyWaitForMapping). We should detect and avoid canceling in this situation.
Closes #47974