Skip to content

Reindex API: Reindex task es_rejected_execution_exception search queue failure #26153

@berglh

Description

@berglh

Environment

I have a 10 data node, 5 master only node cluster with the request handled by a data node.
The index in question has 5 primary shards and 1 replica.

Product Version
Elasticsearch (Official Container) 5.5.1 (Official Elastic Container)
Plugins X-Pack Removed
OpenJDK 1.8.0_141
Docker 17.05.0-ce, build 89658be
Oracle Linux Server 7.3
Kernel Linux 3.10.0-514.21.2.el7.x86_64

Problem Description

Upon requesting a task to reindex an elasticsearch 2.x created index with a size of ~50 GB and a doc count of 133047546, the task completes with a status of true even though elasticserach produced an error. The EsThreadPoolExecutor error reported alludes the search queue capacity as being exceeded, presumably by the Scroll search requests contending for queue space.

To me, it appears that the Reindex API is stopping the task on any error. I can sort of appreciate you don't want something to silently fail, but actually I believe it should be the job of the Scroll client (in this case the Reindex API) to identify the search queue has been exceeded and continue to retry.

This is kind of touched on in this Github issue: Reindex API : improve robustness in case of error

Increasing the Scroll size of the Reindex improves the ability of the Reindex API to make it most of the way through the process. However, on a large enough index, I continue to hit this problem.

My suggestion is that the Reindex API should retry on this soft error.

Supplementary Problem

In addition to this, I have a problem with the description of the requests_per_second URI parameter in the documentation: Reindex API: URL Parameters.

requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc) and throttles the number of requests per second that the reindex issues or it can be set to -1 to disabled throttling. The throttling is done waiting between bulk batches so that it can manipulate the scroll timeout. The wait time is the difference between the time it took the batch to complete and the time requests_per_second * requests_in_the_batch. Since the batch isn’t broken into multiple bulk requests large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". The default is -1.

I interpret this instruction as the value of requests_per_second limiting the number of Scroll searches or ES bulk writes as "requests" conducted per second.

What I experienced actually was that setting requests_per_second to 0.5 resulted in a wait time of ~15500 seconds for a bulk size of 10000. It seems like this setting actually restricts the number of search results per second or is creating bucket loads of write requests for a Scroll size of 10000.

I tried to use this to limit the impact of the Reindex on the seach queues, but not until I set this value to 5000 for a Scroll size of 10000 did I start to see the kind of rate limiting that I am after. I can open another issue for this if required, not sure if there is a bug in ES bulk writing or just a disambiguation problem.

Reproduction Command

POST _reindex?wait_for_completion=false&wait_for_active_shards=all
{
  "source": {
    "index": "largeindex",
    "size": 10000,
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "@timestamp": {
                "gte": "2016-01-01T00:00:00.000",
                "lte": "2017-10-11T00:00:00.000",
                "time_zone": "+10:00"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "largeindex.es5"
  }
}

Output

{
  "_index" : ".tasks",
  "_type" : "task",
  "_id" : "fmVI6xlZQCmhqZqVPIjfXA:71121809",
  "_score" : 1.0,
  "_source" : {
    "completed" : true,
    "task" : {
      "node" : "fmVI6xlZQCmhqZqVPIjfXA",
      "id" : 71121809,
      "type" : "transport",
      "action" : "indices:data/write/reindex",
      "status" : {
        "total" : 133047546,
        "updated" : 0,
        "created" : 74140000,
        "deleted" : 0,
        "batches" : 7414,
        "version_conflicts" : 0,
        "noops" : 0,
        "retries" : {
          "bulk" : 0,
          "search" : 0
        },
        "throttled_millis" : 0,
        "requests_per_second" : -1.0,
        "throttled_until_millis" : 0
      },
      "description" : "reindex from [largeindex] to [largeindex.es5]",
      "start_time_in_millis" : 1502331869750,
      "running_time_in_nanos" : 9825436625697,
      "cancellable" : true
    },
    "response" : {
      "took" : 9825436,
      "timed_out" : false,
      "total" : 133047546,
      "updated" : 0,
      "created" : 74140000,
      "deleted" : 0,
      "batches" : 7414,
      "version_conflicts" : 0,
      "noops" : 0,
      "retries" : {
        "bulk" : 0,
        "search" : 0
      },
      "throttled_millis" : 0,
      "requests_per_second" : -1.0,
      "throttled_until_millis" : 0,
      "failures" : [
        {
          "shard" : -1,
          "reason" : {
            "type" : "es_rejected_execution_exception",
            "reason" : "rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@63806e47 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@229e062f[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 6373350]]"
          }
        }
      ]
    }
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed/CRUDA catch all label for issues around indexing, updating and getting a doc by id. Not search.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions