Added option to do busy-wait on epoll by merlimat · Pull Request #8267 · netty/netty

merlimat · 2018-09-06T02:32:38Z

Motivation:

Add an option (through a SelectStrategy return code) to have the Netty event loop thread to do busy-wait on the epoll.

The reason for this change is to avoid the context switch cost that comes when the event loop thread is blocked on the epoll_wait() call.

On average, the context switch has a penalty of ~13usec.

This benefits both:

The latency when reading from a socket
Scheduling tasks to be executed on the event loop thread.

The tradeoff, when enabling this feature, is that the event loop thread will be using 100% cpu, even when inactive.

Modification:

Added SelectStrategy option to return BUSY_WAIT
Epoll loop will do a epoll_wait() with no timeout
Use pause instruction to hint to processor that we're in a busy loop

Result:

When enabled, minimizes impact of context switch in the critical path

normanmaurer · 2018-09-06T05:31:06Z

...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java

@@ -0,0 +1,102 @@
+/*
+ * Copyright 2014 The Netty Project


normanmaurer · 2018-09-06T05:32:08Z

@merlimat just curious what you use this for... Do you try to minimise the latency of some app that is really network heavy ?

merlimat · 2018-09-06T15:50:30Z

@normanmaurer main goal is to reduce context switches in the critical path. Currently there can be multiple tasks executed in the Netty event-loop. For example, when writing to a channel from a different thread we have to "jump" to the event loop (to retain ordering and avoid mutex contention).

By having the eventLoop thread spinning, when you submit the task it will be picked up immediately (generally in ~100ns) compared to waking up the thread from epoll_wait through the eventfd which will take ~10usec. The same latency reduction applies when reading something from a channel, if the thread is spinning it will "see" the message earlier.

While a single context switch is not the end of the world, in a complex application the number of these "thread jumps" and socket reads chains can be significant and become a dominating factor in the overall request/response latency.

Of course this is only really useful when trying to squeeze latency at very minimum, at the expense of CPU usage (and max throughput, since it would be keeping the cpu busy for event loop while it could be used for other threads).

Also, this would be typically useful paired with CPU affinity as well. The effects of the 2 are:

Busy-wait brings down avg/med latency offset
CPU affinity will lower 99pct (and above) long tails by removing intermittent jitter

normanmaurer · 2018-09-06T15:58:00Z

@merlimat makes sense... do you have any benchmarks to share (Just curious) ?

normanmaurer · 2018-09-06T15:59:17Z

transport-native-epoll/src/main/java/io/netty/channel/epoll/Native.java

+        }
+        return ready;
+    }
+    private static native int epollBusyWait0(int efd, long address, int len);


nit: add empty line above.

normanmaurer · 2018-09-06T16:00:34Z

...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java

+        };
+    }
+
+    private BootstrapFactory<Bootstrap> clientSocket() {


nit: static ?

normanmaurer · 2018-09-06T16:00:40Z

...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java

+        return list;
+    }
+
+    private BootstrapFactory<ServerBootstrap> serverSocket() {


nit: static ?

normanmaurer · 2018-09-06T16:01:15Z

...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java

+                        return new SelectStrategy() {
+                            @Override
+                            public int calculateStrategy(IntSupplier selectSupplier, boolean hasTasks)
+                                    throws Exception {


nit: remove throws ....

...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java

merlimat · 2018-09-07T19:14:44Z

@normanmaurer Yes, I did some brief measurements. This is the latency for an rpc request/response application. There are additional thread switches (though they are already using a local queue and busy-waiting on that as well).

Conditions:

Client/server on same machine with TCP connection
Tested on a bare-metal node
Client sends 1K rps (size 1KB) and waits for response
Multiple request outstanding, responses in order
Latency measured in millis

	50pct	95pct	99pct	99.9pct	99.99pct	99.999pct	max
Baseline	0.049	0.059	0.066	0.095	0.449	0.465	0.466
Busy-Wait IO Threads	0.036	0.047	0.052	0.075	0.436	0.445	0.452
Disable HyperThread	0.029	0.038	0.041	0.054	0.204	0.388	0.396
CPU-Isolation	0.029	0.037	0.040	0.050	0.066	0.090	0.104

I have included the disable hyper-threading and cpu-isolation as they are the next steps (and easily configured in linux settings).
The remaining bulk of the median latency I think is related to SO_BUSY_POLL (#8268) being ineffective on loopback interface.

This is a very simple example, but in a more complex app, there would be more thread jumps and possibly more RPCs involved.

carl-mastrangelo · 2018-09-07T21:12:47Z

@merlimat Have you considered busy polling on reads or writes, rather than on polls? Even if poll does return something, you need to follow it up with a subsequent read, which is going to add a syscall of additional latency before you get your data. How many active sockets do you have, and are you trying to busy poll on writes or reads more?

merlimat · 2018-09-08T17:47:39Z

@carl-mastrangelo It's not just the read or write. The Netty event loop is also used to execute task in the same connection thread.

For example, to ensure ordering of operations and to avoid mutex contention, we typically do :

ctx.channel().eventLoop().execute(() -> {
    ctx.writeAndFlush(buffer); 
});

So the event loop blocks on the epoll_wait but also keeps an eye on the tasks that are posted, using the eventfd mechanism to wake up epoll_wait immediately when a task is posted.

Even if poll does return something, you need to follow it up with a subsequent read, which is going to add a syscall of additional latency before you get your data.

Sure, the read will involve a syscall, but the data will be available and it won't be blocking. The main goal here is to make sure the IO thread is never (or to the least possible extent) descheduled. The cost in terms of latency of the context switch (~10 usec) is much higher than the cost of the syscall.

Also, we always have multiple connections to manage, with the direct read() approach (which I don't think is currently possible to do through Netty), we'd need to spin on 1 thread/connection, while with epoll we can just spin on 1 (or few) IO thread and leave the other CPUs for the rest of application.

merlimat · 2018-09-08T19:43:37Z

@normanmaurer addressed comments

carl-mastrangelo · 2018-09-10T03:38:18Z

@merlimat two things:

I asked how many sockets you have because if you do have multiple, it might make more sense to dedicate a single thread to each one, one per core. Then you could do reads directly.
On a recent PR I made (Don't re-arm timerfd each epoll_wait #7816) it was measured that the changing the timerfd took about 4us, so I'll pick that as a baseline for syscall time. Why do you not care about the followup read, if you do care about the context switch? They are in the same order of magnitude.

merlimat · 2018-09-17T17:49:00Z

@carl-mastrangelo

I asked how many sockets you have because if you do have multiple, it might make more sense to dedicate a single thread to each one, one per core. Then you could do reads directly.

There is always more than 1 connection, from 10s to 100s or 1000s, so dedicating one core per connection is not really feasible.

On a recent PR I made (Don't re-arm timerfd each epoll_wait #7816) it was measured that the changing the timerfd took about 4us, so I'll pick that as a baseline for syscall time. Why do you not care about the followup read, if you do care about the context switch? They are in the same order of magnitude.

How did you measure that time? 4 usec for a syscall seems very high to me.

Just did some experiments and was seeing timings of < 0.8usec for epoll syscalls with no timeout (including JNI overhead).

carl-mastrangelo · 2018-09-17T20:33:40Z

How did you measure that time? 4 usec for a syscall seems very high to me.

The latency win on the benchmark. Also, I frequently run ptrace with timing output, so I think 4 us is not uncommon for syscalls.

I guess I don't have anything really against this, I just don't believe it will the solution you actually need.

normanmaurer · 2018-09-20T17:31:48Z

@carl-mastrangelo @merlimat so we want to pull this in or not ? I am happy either way :)

carl-mastrangelo · 2018-09-20T19:20:06Z

@normanmaurer I'll defer to you, but I can confidently say Google cannot use this (busy looping is extremely expensive for power and opportunity cost to run something else).

But, @merlimat may be willing to make the trade off.

merlimat · 2018-09-20T19:27:04Z

Sure, this setting is definitely not for general use, or meant to be a default.

We plan to give the option to enable busy wait in BookKeeper and Pulsar, for cases where one wants to maximize latency at expense of throughput and CPU consumption.

netty-bot · 2018-09-21T19:27:10Z

Can one of the admins verify this patch?

normanmaurer · 2018-09-28T20:22:53Z

@netty-bot test this please

normanmaurer · 2018-09-28T20:52:07Z

@merlimat thanks a lot!

merlimat · 2018-09-28T21:53:40Z

Thanks for getting this in!

johnou · 2018-09-30T15:07:34Z

@merlimat you mean minimise latency? ;)

merlimat · 2018-10-01T04:40:58Z

Ouch. Or maybe “maximize the responsiveness” :)

### Motivation * Upgrade to latest Netty version which brings in perf improvements and some new features (eg: netty/netty#8267) * Broke down the dependencies from `netty-all` into individual components, as discussed at #1755 (comment) Reviewers: Ivan Kelly <ivank@apache.org>, Enrico Olivelli <eolivelli@gmail.com>, Sijie Guo <sijie@apache.org> This closes #1784 from merlimat/netty-4.1.31

rohitsahay2000 · 2018-11-13T09:24:06Z

Will this solve #327 ?

VaughnVernon · 2020-01-30T17:23:33Z

Hi All, We are looking for some help on applying the resolution you reached here to another project. Please see:

#327 (comment)

Added option to do busy-wait on epoll

85b8f37

normanmaurer reviewed Sep 6, 2018

View reviewed changes

Fixed copyright year

690959f

normanmaurer reviewed Sep 6, 2018

View reviewed changes

...ort-native-epoll/src/test/java/io/netty/channel/epoll/EpollSocketStringEchoBusyWaitTest.java Show resolved Hide resolved

merlimat added 2 commits September 8, 2018 11:04

Addressed comments

ca568b0

Added ET / LT variations of test

5bfe1cd

normanmaurer approved these changes Sep 19, 2018

View reviewed changes

normanmaurer added this to the 4.1.31.Final milestone Sep 28, 2018

normanmaurer self-assigned this Sep 28, 2018

normanmaurer added the feature label Sep 28, 2018

normanmaurer merged commit 3a96e73 into netty:4.1 Sep 28, 2018

merlimat mentioned this pull request Nov 5, 2018

Upgrade to Netty 4.1.31 and use individual dependencies apache/bookkeeper#1784

Merged

merlimat mentioned this pull request Nov 8, 2018

Added BlockingQueue implementation based on JCtools apache/bookkeeper#1682

Merged

njhill mentioned this pull request Jun 21, 2019

POC: Tag-team event loop #9265

Closed

njhill mentioned this pull request Aug 7, 2020

Draft - io_uring - GSoC 2020 #10356

Closed

Uh oh!

Conversation

merlimat commented Sep 6, 2018

Uh oh!

normanmaurer Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Sep 6, 2018

Uh oh!

merlimat commented Sep 6, 2018

Uh oh!

normanmaurer commented Sep 6, 2018

Uh oh!

normanmaurer Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

normanmaurer Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

normanmaurer Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

normanmaurer Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

merlimat commented Sep 7, 2018

Uh oh!

carl-mastrangelo commented Sep 7, 2018

Uh oh!

merlimat commented Sep 8, 2018

Uh oh!

merlimat commented Sep 8, 2018

Uh oh!

carl-mastrangelo commented Sep 10, 2018

Uh oh!

merlimat commented Sep 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carl-mastrangelo commented Sep 17, 2018

Uh oh!

normanmaurer commented Sep 20, 2018

Uh oh!

carl-mastrangelo commented Sep 20, 2018

Uh oh!

merlimat commented Sep 20, 2018

Uh oh!

netty-bot commented Sep 21, 2018

Uh oh!

normanmaurer commented Sep 28, 2018

Uh oh!

normanmaurer commented Sep 28, 2018

Uh oh!

merlimat commented Sep 28, 2018

Uh oh!

johnou commented Sep 30, 2018

Uh oh!

merlimat commented Oct 1, 2018

Uh oh!

rohitsahay2000 commented Nov 13, 2018

Uh oh!

VaughnVernon commented Jan 30, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

merlimat commented Sep 17, 2018 •

edited

Loading