stuck initializing shards & NumberFormatException

I have a cluster that is currently stuck (yellow state) with a logstash index that was created midnight unable to assign a shard and two shards initializing. I examined the elasticsearch logs and found the following messages appaearing a lot:

```
[2014-11-27 10:15:12,585][WARN ][cluster.action.shard     ] [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution failed]; nested: RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]]; nested: NumberFormatException[For input string: "finished"]; ]]
```

and

```
[2014-11-27 10:17:54,187][WARN ][cluster.action.shard     ] [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]]; nested: NumberFormatException[For input string: "finished"]; ]]
```

The NumberFormatException looks like a possible cause. One possible explanation is that we have dynamically mapped logstash field that is sometimes a string and sometimes a number and since we roll over the index there's a chance that this field gets mapped incorrectly depending on what comes in first. However, I don't see how this should block shard initialization. Since midnight we've accumulated about 300M of errors like above. Normally our logs for each day are in the range of a few KB.

The index is actually available and accepting writes (i.e. kibana works as you would expect). But it likely is missing updates for some shards but if so, that is not apparent from the logs.

```
[linko@es3 elasticsearch]$ curl -XGET 'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
  "cluster_name" : "linko_elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 1
}
```

Our cluster has been running for a few weeks. We haven't really done any config changes lately. At least not on our logstash indices. This issue has happened before and I resolved it with a rolling restart at the time.

I'm running 1.4.0 and have not upgraded to 1.4.1 yet. I'm planning to do so later today and I hope this problem will go away with a rolling restart. Meanwhile, I'm available for the next two hours or so to do more diagnostics on this cluster to get more info if needed/useful. If so, please let me know. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stuck initializing shards & NumberFormatException #8684

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

stuck initializing shards & NumberFormatException #8684

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions