-
Notifications
You must be signed in to change notification settings - Fork 25.8k
A method to reduce the time cost to update cluster state #46941
Description
ES_VERSION: 5.6.8
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
As it's known, Updating cluster state on master node will cost too much time, which seriously affects the size and stability of the cluster. In out product, updating cluster state will cost 15s+ with the cluster of 50 nodes and 3,000 indices, 60,000 shard, the experience is very poor when we want to create index and delete index.
To find out why it cost so much time on updating cluste state, I get the thread stack about updateTask, such that:
"elasticsearch[node1][clusterService#updateTask][T#1]" #32 daemon prio=5 os_prio=0 tid=0x00007f5d703a2800 nid=0x8252 runnable [0x00007f5c22b71000]
java.lang.Thread.State: RUNNABLE
at java.util.Collections$UnmodifiableCollection$1.hasNext(Collections.java:1041)
at org.elasticsearch.cluster.routing.RoutingNode.shardsWithState(RoutingNode.java:148)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.sizeOfRelocatingShards(DiskThresholdDecider.java:90)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.getDiskUsage(DiskThresholdDecider.java:320)
at org.elasticsearch.cluster.routing.allocation.decider.DiskThresholdDecider.canRemain(DiskThresholdDecider.java:265)
at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canRemain(AllocationDeciders.java:105)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.decideMove(BalancedShardsAllocator.java:687)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator$Balancer.moveShards(BalancedShardsAllocator.java:648)
at org.elasticsearch.cluster.routing.allocation.allocator.BalancedShardsAllocator.allocate(BalancedShardsAllocator.java:123)
at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:329)
at org.elasticsearch.cluster.routing.allocation.AllocationService.applyStartedShards(AllocationService.java:100)
I try several times and get the same thread stack. it seems that DiskThresholdDecider.sizeOfRelocatingShards will cost too much time, the code is as follow:
static long sizeOfRelocatingShards(RoutingNode node, RoutingAllocation allocation,
boolean subtractShardsMovingAway, String dataPath) {
ClusterInfo clusterInfo = allocation.clusterInfo();
long totalSize = 0;
for (ShardRouting routing : node.shardsWithState(ShardRoutingState.RELOCATING,
ShardRoutingState.INITIALIZING)) {
......
}
......
}
It says that: to test whether the shard can remain stay on the node or not ,we will get the size of relocating shards, then we will get all the shards(about 6,000 shards on one node) of the node, check the shards if is be RELOCATING or INITIALIZING. This is only one shard, there have 60,000 shard need to be test, and will be 60,000 * 6,000 times checkout, which will cost too much times.
I find that we can use the settings to avoid this check: "cluster.routing.allocation.disk.include_relocations":"false". when i set it to be false, the time to update cluster state decreases from 15s to 3s which has achives better result.
if we could set the cluster.routing.allocation.disk.include_relocations to be false by default, most of us will ignore the default setting. or if we could reserve the shard state of relocating and initializing about every node in cluster state, so we will not find out the shards every time by checking every time when updating cluster state.