Setup :
5 similar nodes :
btrainer-1.182 (192.168.1.182) (Current Master before incident)
btrainer-1.186 (192.168.1.186)
btrainer-1.136 (192.168.1.136)
btrainer-13.137 (192.168.13.137)
btrainer-1.138 (192.168.1.138)
ES Configs : (version : 0.19.8)
cluster.name: btrainer
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [ "192.168.1.182:10300", "192.168.1.186:10300", "192.168.1.136:10300", "192.168.13.137:10300", "192.168.1.138:10300" ]
http.port: 10200
index.number_of_replicas: 4
transport.tcp.port: 10300
Java Options :
-Des-foreground=yes
-Des.path.home=/elasticsearch
-Xms4096m
-Xmx20480m
-Djline.enabled=true
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-cp /elasticsearch/lib/*:/elasticsearch/lib/sigar/*
org.elasticsearch.bootstrap.ElasticSearch
Problem :
This problem repeats itself every 5-12 hours period. When everything running smoothly (cluster is green) 1 node goes down and everynode creates its own cluster (not 1/4 split, 1/1/1/1/1 split). The sample problem happened exactly at 22:06, we have a job checking cluster state every minute. This cluster mainly used for training so we have heavy traffic spikes on both reads and writes when jobs are triggered (also some continious small reads).
- What happened to btrainer-1.138 ?
- Even if 1 node (btrainer-1.138) behaves irrationally why didn't the cluster split by 1/4; why did other nodes lose the master btrainer-1.182 ?
Logs :
you can check the logs from the nodes : https://gist.github.com/3510448
Setup :
ES Configs : (version : 0.19.8)
Java Options :
Problem :
This problem repeats itself every 5-12 hours period. When everything running smoothly (cluster is green) 1 node goes down and everynode creates its own cluster (not 1/4 split, 1/1/1/1/1 split). The sample problem happened exactly at 22:06, we have a job checking cluster state every minute. This cluster mainly used for training so we have heavy traffic spikes on both reads and writes when jobs are triggered (also some continious small reads).
Logs :
you can check the logs from the nodes : https://gist.github.com/3510448