Skip to content

stuck initializing shards & NumberFormatException #8684

@jillesvangurp

Description

@jillesvangurp

I have a cluster that is currently stuck (yellow state) with a logstash index that was created midnight unable to assign a shard and two shards initializing. I examined the elasticsearch logs and found the following messages appaearing a lot:

[2014-11-27 10:15:12,585][WARN ][cluster.action.shard     ] [192.168.1.13] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to start shard, message [RecoveryFailedException[[logstash-2014.11.27][4]: Recovery failed from [192.168.1.14][sE51TBxfQ2q6pD5k7G7piA][es2.inbot.io][inet[/192.168.1.14:9300]] into [192.168.1.13][o9vhU4BhSCuQ4BmLJjPtfA][es1.inbot.io][inet[/192.168.1.13:9300]]{master=true}]; nested: RemoteTransportException[[192.168.1.14][inet[/192.168.1.14:9300]][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[[logstash-2014.11.27][4] Phase[2] Execution failed]; nested: RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][internal:index/shard/recovery/translog_ops]]; nested: NumberFormatException[For input string: "finished"]; ]]

and

[2014-11-27 10:17:54,187][WARN ][cluster.action.shard     ] [192.168.1.14] [logstash-2014.11.27][4] sending failed shard for [logstash-2014.11.27][4], node[o9vhU4BhSCuQ4BmLJjPtfA], [R], s[INITIALIZING], indexUUID [-mMLqYjAQuCUDcczYf5SHA], reason [Failed to perform [indices:data/write/bulk[s]] on replica, message [RemoteTransportException[[192.168.1.13][inet[/192.168.1.13:9300]][indices:data/write/bulk[s][r]]]; nested: NumberFormatException[For input string: "finished"]; ]]

The NumberFormatException looks like a possible cause. One possible explanation is that we have dynamically mapped logstash field that is sometimes a string and sometimes a number and since we roll over the index there's a chance that this field gets mapped incorrectly depending on what comes in first. However, I don't see how this should block shard initialization. Since midnight we've accumulated about 300M of errors like above. Normally our logs for each day are in the range of a few KB.

The index is actually available and accepting writes (i.e. kibana works as you would expect). But it likely is missing updates for some shards but if so, that is not apparent from the logs.

[linko@es3 elasticsearch]$ curl -XGET 'localhost:9200/_cluster/health/logstash-2014.11.27/?pretty'
{
  "cluster_name" : "linko_elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 5,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 1
}

Our cluster has been running for a few weeks. We haven't really done any config changes lately. At least not on our logstash indices. This issue has happened before and I resolved it with a rolling restart at the time.

I'm running 1.4.0 and have not upgraded to 1.4.1 yet. I'm planning to do so later today and I hope this problem will go away with a rolling restart. Meanwhile, I'm available for the next two hours or so to do more diagnostics on this cluster to get more info if needed/useful. If so, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions