Environment
I have a 10 data node, 5 master only node cluster with the request handled by a data node.
The index in question has 5 primary shards and 1 replica.
| Product |
Version |
| Elasticsearch (Official Container) |
5.5.1 (Official Elastic Container) |
| Plugins |
X-Pack Removed |
| OpenJDK |
1.8.0_141 |
| Docker |
17.05.0-ce, build 89658be |
| Oracle Linux Server |
7.3 |
| Kernel |
Linux 3.10.0-514.21.2.el7.x86_64 |
Problem Description
Upon requesting a task to reindex an elasticsearch 2.x created index with a size of ~50 GB and a doc count of 133047546, the task completes with a status of true even though elasticserach produced an error. The EsThreadPoolExecutor error reported alludes the search queue capacity as being exceeded, presumably by the Scroll search requests contending for queue space.
To me, it appears that the Reindex API is stopping the task on any error. I can sort of appreciate you don't want something to silently fail, but actually I believe it should be the job of the Scroll client (in this case the Reindex API) to identify the search queue has been exceeded and continue to retry.
This is kind of touched on in this Github issue: Reindex API : improve robustness in case of error
Increasing the Scroll size of the Reindex improves the ability of the Reindex API to make it most of the way through the process. However, on a large enough index, I continue to hit this problem.
My suggestion is that the Reindex API should retry on this soft error.
Supplementary Problem
In addition to this, I have a problem with the description of the requests_per_second URI parameter in the documentation: Reindex API: URL Parameters.
requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc) and throttles the number of requests per second that the reindex issues or it can be set to -1 to disabled throttling. The throttling is done waiting between bulk batches so that it can manipulate the scroll timeout. The wait time is the difference between the time it took the batch to complete and the time requests_per_second * requests_in_the_batch. Since the batch isn’t broken into multiple bulk requests large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". The default is -1.
I interpret this instruction as the value of requests_per_second limiting the number of Scroll searches or ES bulk writes as "requests" conducted per second.
What I experienced actually was that setting requests_per_second to 0.5 resulted in a wait time of ~15500 seconds for a bulk size of 10000. It seems like this setting actually restricts the number of search results per second or is creating bucket loads of write requests for a Scroll size of 10000.
I tried to use this to limit the impact of the Reindex on the seach queues, but not until I set this value to 5000 for a Scroll size of 10000 did I start to see the kind of rate limiting that I am after. I can open another issue for this if required, not sure if there is a bug in ES bulk writing or just a disambiguation problem.
Reproduction Command
POST _reindex?wait_for_completion=false&wait_for_active_shards=all
{
"source": {
"index": "largeindex",
"size": 10000,
"query": {
"bool": {
"must": [
{
"range": {
"@timestamp": {
"gte": "2016-01-01T00:00:00.000",
"lte": "2017-10-11T00:00:00.000",
"time_zone": "+10:00"
}
}
}
]
}
}
},
"dest": {
"index": "largeindex.es5"
}
}
Output
{
"_index" : ".tasks",
"_type" : "task",
"_id" : "fmVI6xlZQCmhqZqVPIjfXA:71121809",
"_score" : 1.0,
"_source" : {
"completed" : true,
"task" : {
"node" : "fmVI6xlZQCmhqZqVPIjfXA",
"id" : 71121809,
"type" : "transport",
"action" : "indices:data/write/reindex",
"status" : {
"total" : 133047546,
"updated" : 0,
"created" : 74140000,
"deleted" : 0,
"batches" : 7414,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0
},
"description" : "reindex from [largeindex] to [largeindex.es5]",
"start_time_in_millis" : 1502331869750,
"running_time_in_nanos" : 9825436625697,
"cancellable" : true
},
"response" : {
"took" : 9825436,
"timed_out" : false,
"total" : 133047546,
"updated" : 0,
"created" : 74140000,
"deleted" : 0,
"batches" : 7414,
"version_conflicts" : 0,
"noops" : 0,
"retries" : {
"bulk" : 0,
"search" : 0
},
"throttled_millis" : 0,
"requests_per_second" : -1.0,
"throttled_until_millis" : 0,
"failures" : [
{
"shard" : -1,
"reason" : {
"type" : "es_rejected_execution_exception",
"reason" : "rejected execution of org.elasticsearch.transport.TcpTransport$RequestHandler@63806e47 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@229e062f[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 6373350]]"
}
}
]
}
}
}
Environment
I have a 10 data node, 5 master only node cluster with the request handled by a data node.
The index in question has 5 primary shards and 1 replica.
Problem Description
Upon requesting a task to reindex an
elasticsearch 2.xcreated index with a size of~50 GBand a doc count of133047546, the task completes with a status oftrueeven though elasticserach produced an error. TheEsThreadPoolExecutorerror reported alludes the search queue capacity as being exceeded, presumably by the Scroll search requests contending for queue space.To me, it appears that the Reindex API is stopping the task on any error. I can sort of appreciate you don't want something to silently fail, but actually I believe it should be the job of the Scroll client (in this case the Reindex API) to identify the search queue has been exceeded and continue to retry.
This is kind of touched on in this Github issue: Reindex API : improve robustness in case of error
Increasing the Scroll size of the Reindex improves the ability of the Reindex API to make it most of the way through the process. However, on a large enough index, I continue to hit this problem.
My suggestion is that the Reindex API should retry on this soft error.
Supplementary Problem
In addition to this, I have a problem with the description of the
requests_per_secondURI parameter in the documentation: Reindex API: URL Parameters.I interpret this instruction as the value of
requests_per_secondlimiting the number of Scroll searches or ES bulk writes as"requests"conducted per second.What I experienced actually was that setting
requests_per_secondto0.5resulted in a wait time of~15500seconds for a bulk size of10000. It seems like this setting actually restricts the number of searchresultsper second or is creating bucket loads of write requests for a Scroll size of10000.I tried to use this to limit the impact of the Reindex on the seach queues, but not until I set this value to
5000for a Scroll size of10000did I start to see the kind of rate limiting that I am after. I can open another issue for this if required, not sure if there is a bug in ES bulk writing or just a disambiguation problem.Reproduction Command
Output