Skip to content

Thread leak with autodiscover after adding / removing a master node #8057

@fdv

Description

@fdv

I've had an epic bug leading to a thread leak (and ES dying because it had no more RAM to create more threads). Whole story here.

Bug description

4 nodes ES cluster on EC2. 1 routing node, 3 data nodes. The routing node acts as master (configuration provided below)

After I added and removed a master node, one of the data nodes kept sending auto discovery requests to a the gone node. This led to the remaining routing node to create about 1 thread / second (sending auto discovery requests) and never closing them.

Configuration

  • ES 1.0.1 (old, I know)
  • ES Transport Thrift: elasticsearch-transport-thrift-2.0.0
  • AWS cloud plugin: cloud-aws-2.0.0

Routing node

bootstrap:
  mlockall: true
cloud:
  aws:
    access_key: something
    region: us-east-1
    secret_key: something
cluster:
  name: robots
discovery:
  ec2:
    ping_timeout: 360
    tag:
      Cluster: production
  type: ec2
  zen:
    minimum_master_nodes: 1
gateway:
  expected_nodes: 4
  recover_after_nodes: 4
  recover_after_time: 5m
http:
  max_content_length: 100mb
index:
  query:
    bool:
      max_clause_count: 1000000
  refresh_interval: 300
  store:
    type: mmapfs
indices:
  fielddata:
    cache:
      expire: 10m
      size: 30%
  memory:
    index_buffer_size: 10%
network:
  host: 0.0.0.0
node:
  data: false
  master: true
  name: something
path:
  data: /mnt/elasticsearch
  logs: /var/log/elasticsearch

Data nodes

bootstrap:
  mlockall: true
cloud:
  aws:
    access_key: something
    region: us-east-1
    secret_key: something
cluster:
  name: robots
discovery:
  ec2:
    ping_timeout: 360
    tag:
      Cluster: production
  type: ec2
  zen:
    minimum_master_nodes: 1
gateway:
  expected_nodes: 4
  recover_after_nodes: 4
  recover_after_time: 5m
http:
  max_content_length: 100mb
index:
  query:
    bool:
      max_clause_count: 1000000
  refresh_interval: 300
  store:
    type: mmapfs
indices:
  fielddata:
    cache:
      expire: 10m
      size: 30%
  memory:
    index_buffer_size: 10%
network:
  host: 0.0.0.0
node:
  data: true
  master: false
  name: something
path:
  data: /mnt/elasticsearch
  logs: /var/log/elasticsearch

JVM

  • xmx and xms to 4G
  • no fancy GC tuning

Tell me if I you need anything else.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions