The explanation for a shard whose allocation is blocked by the RestoreInProgressAllocationDecider describes the situation accurately, but not in a way that helps users understand the problem:
"shard has failed to be restored from the snapshot [%s] - manually close or delete the index [%s] in order to retry "
+ "to restore the snapshot again or use the reroute API to force the allocation of an empty primary shard. Details: [%s]",
source.snapshot(),
shardRouting.getIndexName(),
shardRouting.unassignedInfo().getDetails()
If we simply could not assign the shard because of some other deciders then shardRouting.unassignedInfo().getNumFailedAllocations() will be zero and shardRouting.unassignedInfo().getDetails() may not contain any useful information. We should explain that allocating the shard was prevented on all nodes by other allocation deciders rather than simply saying the shard has "failed to be restored from the snapshot".
If we assigned the shard and the recovery failed then shardRouting.unassignedInfo().getNumFailedAllocations() will be positive and I expect there'll be an exception in this message, and also in the logs, so this case is a little clearer. I think we could still say that the recovery started and then failed, and maybe guide users towards the extra detail in the logs.
The explanation for a shard whose allocation is blocked by the
RestoreInProgressAllocationDeciderdescribes the situation accurately, but not in a way that helps users understand the problem:If we simply could not assign the shard because of some other deciders then
shardRouting.unassignedInfo().getNumFailedAllocations()will be zero andshardRouting.unassignedInfo().getDetails()may not contain any useful information. We should explain that allocating the shard was prevented on all nodes by other allocation deciders rather than simply saying the shard has "failed to be restored from the snapshot".If we assigned the shard and the recovery failed then
shardRouting.unassignedInfo().getNumFailedAllocations()will be positive and I expect there'll be an exception in this message, and also in the logs, so this case is a little clearer. I think we could still say that the recovery started and then failed, and maybe guide users towards the extra detail in the logs.