We have a problem with our Elastic Search cluster in every environment. The cluster will sometimes get into an unstable state, wherein certain ES nodes will have load that is several times greater than the load on other nodes.
This can be reproduced every time very quickly by hitting the cluster with about 40 concurrent threads while running data indexing, and much less quickly by simply running constant searches.
Our cluster setup:
- 4 no-data client nodes that service search requests
- 1 no-data client node that sends data indexing requests.
- 10 data nodes that are called by the 5 for indexing and searching.
We have verified that all the boxes have:
- the same hardware
- CPU Intel Xeon X5550 (2.67GHz, 64-bit, 16 core)
- 96GB high-end physical RAM (64GB JVM heap, 20 GB mem baseline).
- the same OS
- Red Hat 4.1.2-54 (Linux version 2.6.18-348.1.1.el5)
- the same JVM
- Sun/Oracle 1.7.0_09 (64-bit)
- the same ES version
- Currently on 0.20.4
- Problem existed since 0.19.8, when we started with ES
- no other software running
- the same number of shards
- 1 product shard per node
- 13M product documents
- (product schema has thousands of fields).
- 1 entity shard per node
- 135M entity documents
- (dozens of entity data types, each with several fields).
- approximately the same amount of data
- product index is 25GB on disk.
- entity index is 12GB on disk.
- All data loads into about 20GB baseline RAM.
- the exact same logical configuration
{
product: {
settings: {
index.translog.flush_threshold_size: 500mb
index.refresh_interval: 30s
index.number_of_replicas: 1
index.translog.disable_flush: false
index.version.created: 190999
index.number_of_shards: 5
index.routing.allocation.total_shards_per_node: 1
index.translog.flush_threshold_period: 60m
index.translog.flush_threshold_ops: 5000
}
}
entity: {
settings: {
index.translog.flush_threshold_size: 500mb
index.refresh_interval: 30s
index.number_of_replicas: 1
index.translog.disable_flush: false
index.version.created: 190999
index.number_of_shards: 5
index.routing.allocation.total_shards_per_node: 1
index.translog.flush_threshold_period: 60m
index.translog.flush_threshold_ops: 5000
}
}
}
Here is what our load graphs look like.
Notice the high load ES node. Odd.

Here is our cluster distribution.
All the shards are similar in size and evenly distributed.

We have a problem with our Elastic Search cluster in every environment. The cluster will sometimes get into an unstable state, wherein certain ES nodes will have load that is several times greater than the load on other nodes.
This can be reproduced every time very quickly by hitting the cluster with about 40 concurrent threads while running data indexing, and much less quickly by simply running constant searches.
Our cluster setup:
We have verified that all the boxes have:
{
product: {
settings: {
index.translog.flush_threshold_size: 500mb
index.refresh_interval: 30s
index.number_of_replicas: 1
index.translog.disable_flush: false
index.version.created: 190999
index.number_of_shards: 5
index.routing.allocation.total_shards_per_node: 1
index.translog.flush_threshold_period: 60m
index.translog.flush_threshold_ops: 5000
}
}
entity: {
settings: {
index.translog.flush_threshold_size: 500mb
index.refresh_interval: 30s
index.number_of_replicas: 1
index.translog.disable_flush: false
index.version.created: 190999
index.number_of_shards: 5
index.routing.allocation.total_shards_per_node: 1
index.translog.flush_threshold_period: 60m
index.translog.flush_threshold_ops: 5000
}
}
}
Here is what our load graphs look like.

Notice the high load ES node. Odd.
Here is our cluster distribution.

All the shards are similar in size and evenly distributed.