Added option to do busy-wait on epoll#8267
Conversation
| @@ -0,0 +1,102 @@ | |||
| /* | |||
| * Copyright 2014 The Netty Project | |||
|
@merlimat just curious what you use this for... Do you try to minimise the latency of some app that is really network heavy ? |
|
@normanmaurer main goal is to reduce context switches in the critical path. Currently there can be multiple tasks executed in the Netty event-loop. For example, when writing to a channel from a different thread we have to "jump" to the event loop (to retain ordering and avoid mutex contention). By having the eventLoop thread spinning, when you submit the task it will be picked up immediately (generally in ~100ns) compared to waking up the thread from While a single context switch is not the end of the world, in a complex application the number of these "thread jumps" and socket reads chains can be significant and become a dominating factor in the overall request/response latency. Of course this is only really useful when trying to squeeze latency at very minimum, at the expense of CPU usage (and max throughput, since it would be keeping the cpu busy for event loop while it could be used for other threads). Also, this would be typically useful paired with CPU affinity as well. The effects of the 2 are:
|
|
@merlimat makes sense... do you have any benchmarks to share (Just curious) ? |
| } | ||
| return ready; | ||
| } | ||
| private static native int epollBusyWait0(int efd, long address, int len); |
| }; | ||
| } | ||
|
|
||
| private BootstrapFactory<Bootstrap> clientSocket() { |
| return list; | ||
| } | ||
|
|
||
| private BootstrapFactory<ServerBootstrap> serverSocket() { |
| return new SelectStrategy() { | ||
| @Override | ||
| public int calculateStrategy(IntSupplier selectSupplier, boolean hasTasks) | ||
| throws Exception { |
...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java
Show resolved
Hide resolved
|
@normanmaurer Yes, I did some brief measurements. This is the latency for an rpc request/response application. There are additional thread switches (though they are already using a local queue and busy-waiting on that as well). Conditions:
I have included the disable hyper-threading and cpu-isolation as they are the next steps (and easily configured in linux settings). This is a very simple example, but in a more complex app, there would be more thread jumps and possibly more RPCs involved. |
|
@merlimat Have you considered busy polling on reads or writes, rather than on polls? Even if poll does return something, you need to follow it up with a subsequent read, which is going to add a syscall of additional latency before you get your data. How many active sockets do you have, and are you trying to busy poll on writes or reads more? |
|
@carl-mastrangelo It's not just the read or write. The Netty event loop is also used to execute task in the same connection thread. For example, to ensure ordering of operations and to avoid mutex contention, we typically do : ctx.channel().eventLoop().execute(() -> {
ctx.writeAndFlush(buffer);
});So the event loop blocks on the
Sure, the read will involve a syscall, but the data will be available and it won't be blocking. The main goal here is to make sure the IO thread is never (or to the least possible extent) descheduled. The cost in terms of latency of the context switch (~10 usec) is much higher than the cost of the syscall. Also, we always have multiple connections to manage, with the direct |
|
@normanmaurer addressed comments |
|
@merlimat two things:
|
There is always more than 1 connection, from 10s to 100s or 1000s, so dedicating one core per connection is not really feasible.
How did you measure that time? 4 usec for a syscall seems very high to me. Just did some experiments and was seeing timings of < 0.8usec for epoll syscalls with no timeout (including JNI overhead). |
The latency win on the benchmark. Also, I frequently run ptrace with timing output, so I think 4 us is not uncommon for syscalls. I guess I don't have anything really against this, I just don't believe it will the solution you actually need. |
|
@carl-mastrangelo @merlimat so we want to pull this in or not ? I am happy either way :) |
|
@normanmaurer I'll defer to you, but I can confidently say Google cannot use this (busy looping is extremely expensive for power and opportunity cost to run something else). But, @merlimat may be willing to make the trade off. |
|
Sure, this setting is definitely not for general use, or meant to be a default. We plan to give the option to enable busy wait in BookKeeper and Pulsar, for cases where one wants to maximize latency at expense of throughput and CPU consumption. |
|
Can one of the admins verify this patch? |
|
@netty-bot test this please |
|
@merlimat thanks a lot! |
|
Thanks for getting this in! |
|
@merlimat you mean minimise latency? ;) |
|
Ouch. Or maybe “maximize the responsiveness” :) |
### Motivation * Upgrade to latest Netty version which brings in perf improvements and some new features (eg: netty/netty#8267) * Broke down the dependencies from `netty-all` into individual components, as discussed at #1755 (comment) Reviewers: Ivan Kelly <ivank@apache.org>, Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes #1784 from merlimat/netty-4.1.31
|
Will this solve #327 ? |
|
Hi All, We are looking for some help on applying the resolution you reached here to another project. Please see: |
Motivation:
Add an option (through a
SelectStrategyreturn code) to have the Netty event loop thread to do busy-wait on the epoll.The reason for this change is to avoid the context switch cost that comes when the event loop thread is blocked on the
epoll_wait()call.On average, the context switch has a penalty of ~13usec.
This benefits both:
The tradeoff, when enabling this feature, is that the event loop thread will be using 100% cpu, even when inactive.
Modification:
SelectStrategyoption to returnBUSY_WAITepoll_wait()with no timeoutpauseinstruction to hint to processor that we're in a busy loopResult: