Reindex API: Disambiguation of requests_per_second by berglh · Pull Request #26185 · elastic/elasticsearch

berglh · 2017-08-14T02:22:40Z

Proposal for disambiguation of requests_per_second as discussed in Reindex API: Reindex task es_rejected_execution_exception search queue failure #26153.

@nik9000 As per our discussion in the aforementioned elasticsearch issue, I am looking to disambiguate the requests_per_second wait function. I compared some of my results and have a proposal for updating the instructions to reflect my experience and your brief explanation. Here are the results of this unsuccessful scroll batch size of 10000, requests_per_second of 10000 Reindex task:

Item	Result
Total Docs	26690000
Total Batches	2669
Task Completion Time	6195 s
Time per Batch	2.32s
Overall EPS	4308
Time Throttled	2668 s
Time Throttled per Batch	1s
Time Working	3527 s
Time Working per Scroll	1.32 s
Working EPS	7567.3
Time Throttle to Work Ratio	0.75

@nik9000 So .5 would write all 10000 documents and then sleep 20000 seconds - the amount of time that the write took.

I interpret this formula as size / requests_per_second = wait_time in seconds. This indeed kind of matches the time in the original post.

So in this case I perform the wait_time calculation and append it to the Time Working per Scroll:

1. 10000 / 10000 = 1 s
2. 1 s + 1.32 s = 2.32 s

That fits nicely. So now I will try how I interpret the documentation:

The wait time is the difference between the time it took the batch to complete and the time requests_per_second * requests_in_the_batch

I interpret this formula as either:

batch_time_read_write - (requests_per_second * requests_in_the_batch)
(requests_per_second * requests_in_the_batch) - batch_time_read_write

1. 1.32 - (10000 * 10000) = −99999998.68
2. (10000 * 10000) - 1.32 = 99999998.68

Neither of these really match, even converting the seconds to nano, it doesn't produce anything meaningful. Of course, this will be different if you are indeed timing the bulk write (~~as you may of suggested~~ You definitely said this was the case).

10000 Failed Task Output

{
  "completed": true,
  "task": {
    "node": "fmVI6xlZQCmhqZqVPIjfXA",
    "id": 84157930,
    "type": "transport",
    "action": "indices:data/write/reindex",
    "status": {
      "total": 279063633,
      "updated": 0,
      "created": 26690000,
      "deleted": 0,
      "batches": 2669,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 2667995,
      "requests_per_second": 10000,
      "throttled_until_millis": 0
    },
    "description": "reindex from [largeindex] to [largeindex.es5]",
    "start_time_in_millis": 1502431979655,
    "running_time_in_nanos": 6195367033709,
    "cancellable": true
  },
  "response": {
    "took": 6195366,
    "timed_out": false,
    "total": 279063633,
    "updated": 0,
    "created": 26690000,
    "deleted": 0,
    "batches": 2669,
    "version_conflicts": 0,
    "noops": 0,
    "retries": {
      "bulk": 0,
      "search": 0
    },
    "throttled_millis": 2667995,
    "requests_per_second": 10000,
    "throttled_until_millis": 0,
    "failures": [
      {
        "shard": -1,
        "reason": {
          "type": "es_rejected_execution_exception",
          "reason": "rejected execution of org.elasticsearch.transport.TransportService$7@48752bce on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@53fe6f13[Running, pool size = 49, active threads = 49, queued tasks = 999, completed tasks = 7727372]]"
        }
      }
    ]
  }
}

So lets consider the results of a successful scroll batch size of 10000 and requests_per_second of 5000 Reindex task:

Item	Result
Total Docs	133317017
Total Batches	13332
Task Completion Time	44458.6 s
Time per Batch	3.33s
Overall EPS	2998.7
Time Throttled	26663 s
Time Throttled per Batch	1.99s
Time Working	17795 s
Time Working per Batch	1.33 s
Working EPS	7491.6
Time Throttle to Work Ratio	1.49

size / requests_per_second = wait_time in seconds between each bulk:

1. 10000 / 5000 = 2 s
2. 2 s + 1.33 s = 3.33 s

Again, the formula suits the metrics perfectly. You'll also notice the Working EPS is very similar as you'd expect with the same scroll size of 10000. However, when applying my interpretation of the formula in the documentation:

batch_time_read_write - (requests_per_second * requests_in_the_batch)
(requests_per_second * requests_in_the_batch) - batch_time_read_write

1. 1.33 - (5000 * 10000) = −49999998.67
2. (5000 * 10000) - 1.33 = 49999998.67

Same kind of result again, not what I am experiencing.

5000 Successful Task Output

{
  "_index": ".tasks",
  "_type": "task",
  "_id": "fmVI6xlZQCmhqZqVPIjfXA:81294668",
  "_score": 1,
  "_source": {
    "completed": true,
    "task": {
      "node": "fmVI6xlZQCmhqZqVPIjfXA",
      "id": 81294668,
      "type": "transport",
      "action": "indices:data/write/reindex",
      "status": {
        "total": 133317017,
        "updated": 0,
        "created": 133317017,
        "deleted": 0,
        "batches": 13332,
        "version_conflicts": 0,
        "noops": 0,
        "retries": {
          "bulk": 0,
          "search": 0
        },
        "throttled_millis": 26663379,
        "requests_per_second": 5000,
        "throttled_until_millis": 0
      },
      "description": "reindex from [largeindex] to [largeindex.es5]",
      "start_time_in_millis": 1502414174752,
      "running_time_in_nanos": 44458656303656,
      "cancellable": true
    },
    "response": {
      "took": 44458656,
      "timed_out": false,
      "total": 133317017,
      "updated": 0,
      "created": 133317017,
      "deleted": 0,
      "batches": 13332,
      "version_conflicts": 0,
      "noops": 0,
      "retries": {
        "bulk": 0,
        "search": 0
      },
      "throttled_millis": 26663379,
      "requests_per_second": 5000,
      "throttled_until_millis": 0,
      "failures": []
    }
  }
}

Proposal for disambiguation of requests_per_second as discusses in [Reindex API: Reindex task es_rejected_execution_exception search queue failure elastic#26153](elastic#26153 (comment)).

elasticmachine · 2017-08-14T02:22:41Z

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

rjernst · 2017-08-14T18:23:43Z

I'm not sure about being so detailed in the internal details about how throttling is implemented. But @nik9000 should probably look at this.

nik9000

I like this a lot better then what I had. I left a note about changing the calculation to make it more clear that the batch write time counts against the wait time. Other things also count against the wait time but they are mostly very fast.

nik9000 · 2017-08-14T19:24:01Z

docs/reference/docs/reindex.asciidoc

+if the `requests_per_second` is set to `500`:
+
+`wait_time_in_seconds` = `1000` / `500` = `2` seconds
+


What about something like:

`target_total_time` = `1000` / `500 per second` = `2 seconds` `wait_time` = `target_total_time` - `batch_write_time` = `2 seconds` - `.5 seconds` = `1.5 seconds`

@nik9000

Updated as per @nik9000 suggestions.