Skip to content

the transport/http worker thread will not be released for a long time#57284

Closed
hanbj wants to merge 1 commit intoelastic:masterfrom
hanbj:same
Closed

the transport/http worker thread will not be released for a long time#57284
hanbj wants to merge 1 commit intoelastic:masterfrom
hanbj:same

Conversation

@hanbj
Copy link
Copy Markdown
Contributor

@hanbj hanbj commented May 28, 2020

When the cluster is very large, when you call interfaces such as _cat/shards or _cat/indices or _all/_mapping, there is no response for a long time, and sometimes even gateway timeout, which is caused by the response is too large.
For example, the rest request method requests _all/_mapping. ES finally calls the toXContent() method to construct a json return. This is a very time-consuming operation and is executed in the http_server_worker thread.
A special thread pool is the same. When submitting tasks to the same thread pool, the worker thread is still the transport_worker / http_server_worker thread. If the task submitted to the same thread pool is a large task, it will cause the worker thread to not be released for a long time. Reduce overall system throughput.

@cla-checker-service
Copy link
Copy Markdown

cla-checker-service bot commented May 28, 2020

💚 CLA has been signed

@dliappis
Copy link
Copy Markdown
Contributor

@hanbj Thank you for the PR. Can you please sign the Contributor Agreement?

@dliappis dliappis added the :Distributed/Network Http and internode communication implementations label May 28, 2020
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Network)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label May 28, 2020
@Tim-Brooks
Copy link
Copy Markdown
Contributor

Can you provide the context behind this PR? My understanding here is that you are concerned that the XContent response is serialized on the transport thread for _cat/shards or _cat/indices or _all/_mapping calls. How large of responses are we discussing and what type of latency does that introduce?

@hanbj
Copy link
Copy Markdown
Contributor Author

hanbj commented May 29, 2020

@tbrooks8 thank you, the background is as follows:
cluster info:
{
"cluster_name" : "ecm",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 159,
"number_of_data_nodes" : 136,
"active_primary_shards" : 31283,
"active_shards" : 57878,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}

I added time to print before and after the messageReceived method of the InboundHandler class. In the log, I found that the messageReceived method sometimes takes a few seconds or even tens of seconds. I observed that when the cluster metadata is very large, such as indices and shards, there is almost no response when I call the _cat/shards interface.

curl -XGET http://127.0.0.1:9200/_mappings > mappings
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 93.4M 100 93.4M 0 0 14.3M 0 0:00:06 0:00:06 --:--:-- 20.8M

At this time, the messageReceived method will also take a lot of time.
[2020-05-29T11:38:54,097][WARN ][o.e.m.r.RequestTracker ] [ecm-0] tcp cost too long,outRequestId=null, requestId=10804437, action=indices:admin/mappings/get, cost=6111ms

So I read the relevant code logic and found that some interfaces are executed in the same thread pool. Therefore, the transport_worker / http_server_worker thread may not be released for a long time and cannot process new requests. The overall throughput of the system will be affected.

@hanbj
Copy link
Copy Markdown
Contributor Author

hanbj commented Jun 1, 2020

@hanbj Thank you for the PR. Can you please sign the Contributor Agreement?
@dliappis Yes, Thank you. I have signed the Contributor Agreement,Why all checks still fail?

@dliappis
Copy link
Copy Markdown
Contributor

dliappis commented Jun 1, 2020

@dliappis Yes, Thank you. I have signed the Contributor Agreement,Why all checks still fail?

@hanbj the problem is that your first commit used a different email (hanbj <hanbaojun@didi<REDACTING>.com>) to your GitHub email (@163.com).

One easy option is to resign with the email you committed with.

@hanbj hanbj closed this Jun 1, 2020
@hanbj hanbj reopened this Jun 1, 2020
@hanbj
Copy link
Copy Markdown
Contributor Author

hanbj commented Jun 5, 2020

@tbrooks8 @dliappis Do you have any other ideas about PR, I see _cat/indices interface is submitted to the MANAGEMENT thread pool for execution, or we can add a thread pool in the ThreadPool class to handle large response.

@original-brownbear
Copy link
Copy Markdown
Contributor

original-brownbear commented Jun 5, 2020

@hanbj how did you measure those 6s time and interpret it as the time to serialize (mainly wondering if your measurement includes non-blocking IO time as well)? Even though there's some expensive operations going on when serializing the mapping response, I find it somewhat unlikely that it will take 6s, even in the case of almost 100MB in mappings.

I think if we decide to fix this it might be fine to simply move the serialization to the generic pool and be done with it (since this shouldn't be called at a high frequency I don't think there's much point in optimizing the logic itself) but I'm having a hard time reproducing a multi-second time to serialize.

@hanbj
Copy link
Copy Markdown
Contributor Author

hanbj commented Jun 10, 2020

@original-brownbear
I used the arthas tool. https://alibaba.github.io/arthas/trace.html

By using trace command, we can actively search the method call path corresponding to class pattern / method pattern, render and count all the performance overhead and trace the call link on the whole call link.

[arthas@1669]$ trace org.elasticsearch.transport.TcpTransport handleResponse '#cost > 5000'
Press Q or Ctrl+C to abort.
Affect(class count: 5 , method count: 1) cost in 513 ms, listenerId: 3
---ts=2020-06-10 22:42:05;thread_name=elasticsearch[ecm-clientnode-19][transport_worker][T#19];id=3c;is_daemon=true;priority=5;TCCL=sun.misc.Launcher$AppClassLoader@70dea4e ---[6939.108529ms] org.elasticsearch.transport.TcpTransport:handleResponse()
+---[26.661807ms] org.elasticsearch.transport.TransportResponseHandler:read() # 982
+---[0.011376ms] org.elasticsearch.common.transport.TransportAddress:() # 983
+---[0.010936ms] org.elasticsearch.transport.TransportResponse:remoteAddress() # 983
+---[0.03432ms] org.elasticsearch.transport.TransportResponseHandler:executor() # 989
+---[0.010652ms] org.elasticsearch.threadpool.ThreadPool:executor() # 989
`---[0.011082ms] org.elasticsearch.transport.TcpTransport$1:() # 989

The handleresponse() method took 6939ms.

@original-brownbear
Copy link
Copy Markdown
Contributor

Thanks @hanbj that makes sense. I don't think the suggested fix here is viable (we can't simply put all kinds of response handling on the management pool just because of a large response) but the problem seems valid to me. I opened #57937 with a suggested fix.

@original-brownbear
Copy link
Copy Markdown
Contributor

I hope you don't mind that I'll close this one for now since the change in this PR isn't something we want to go with as explained above.
There is a possible (though still debatable and debated) mitigation open in #57937 and a performance improvement for compression in #60953 that should also help here.

Thanks so much for bringing this to our attention @hanbj

@hanbj hanbj deleted the same branch August 11, 2020 14:54
original-brownbear added a commit that referenced this pull request Aug 11, 2020
Use thread-local buffers and deflater and inflater instances to speed up
compressing and decompressing from in-memory bytes.
Not manually invoking `end()` on these should be safe since their off-heap memory
will eventually be reclaimed by the finalizer thread which should not be an issue for thread-locals
that are not instantiated at a high frequency.
This significantly reduces the amount of byte copying and object creation relative to the previous approach
which had to create a fresh temporary buffer (that was then resized multiple times during operations), copied
bytes out of that buffer to a freshly allocated `byte[]`, used 4k stream buffers needlessly when working with
bytes that are already in arrays (`writeTo` handles efficient writing to the compression logic now) etc.

Relates #57284 which should be helped by this change to some degree.
Also, I expect this change to speed up mapping/template updates a little as those make heavy use of these
code paths.
original-brownbear added a commit that referenced this pull request Aug 12, 2020
Use thread-local buffers and deflater and inflater instances to speed up
compressing and decompressing from in-memory bytes.
Not manually invoking `end()` on these should be safe since their off-heap memory
will eventually be reclaimed by the finalizer thread which should not be an issue for thread-locals
that are not instantiated at a high frequency.
This significantly reduces the amount of byte copying and object creation relative to the previous approach
which had to create a fresh temporary buffer (that was then resized multiple times during operations), copied
bytes out of that buffer to a freshly allocated `byte[]`, used 4k stream buffers needlessly when working with
bytes that are already in arrays (`writeTo` handles efficient writing to the compression logic now) etc.

Relates #57284 which should be helped by this change to some degree.
Also, I expect this change to speed up mapping/template updates a little as those make heavy use of these
code paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants