Don't re-arm timerfd each epoll_wait by carl-mastrangelo · Pull Request #7816 · netty/netty

carl-mastrangelo · 2018-03-29T19:43:48Z

Motivation:
The Epoll transport checks to see if there are any scheduled tasks
before entering epoll_wait, and resets the timerfd just before.
This causes an extra syscall to timerfd_settime before doing any
actual work. When scheduled tasks aren't added frequently, or
tasks are added with later deadlines, this is unnecessary.

Modification:
Check the deadline of the peeked task in EpollEventLoop, rather
than the delay. If it hasn't changed since last time, don't
re-arm the timer

Result:
About 2us faster on gRPC RTT 50pct latency benhcmarks.

Before (2 runs for 5 minutes, 1 minute of warmup):

50.0%ile Latency (in nanos):		64267
90.0%ile Latency (in nanos):		72851
95.0%ile Latency (in nanos):		78903
99.0%ile Latency (in nanos):		92327
99.9%ile Latency (in nanos):		119691
100.0%ile Latency (in nanos):		13347327
QPS:                           14933

50.0%ile Latency (in nanos):		63907
90.0%ile Latency (in nanos):		73055
95.0%ile Latency (in nanos):		79443
99.0%ile Latency (in nanos):		93739
99.9%ile Latency (in nanos):		123583
100.0%ile Latency (in nanos):		14028287
QPS:                           14936

After:

50.0%ile Latency (in nanos):		62123
90.0%ile Latency (in nanos):		70795
95.0%ile Latency (in nanos):		76895
99.0%ile Latency (in nanos):		90887
99.9%ile Latency (in nanos):		117819
100.0%ile Latency (in nanos):		14126591
QPS:                           15387

50.0%ile Latency (in nanos):		61021
90.0%ile Latency (in nanos):		70311
95.0%ile Latency (in nanos):		76687
99.0%ile Latency (in nanos):		90887
99.9%ile Latency (in nanos):		119527
100.0%ile Latency (in nanos):		6351615
QPS:                           15571

normanmaurer · 2018-03-29T19:46:35Z

transport-native-epoll/src/main/c/netty_epoll_native.c

Can you add a comment what the 0x80000000 stands for ?

like "Integer.MIN_VALUE"

I considered calling this JINT_MIN or something like that, but didnt know how to namespace this properly.

Sure, though.

carl-mastrangelo · 2018-03-29T19:46:42Z

FYI, if there are no scheduled tasks pending, this speedup is lost. The core problem is that AbstractScheduledEventExecutor is hiding all the useful methods from the child classes.

normanmaurer · 2018-03-29T19:48:33Z

@carl-mastrangelo nice one... I guess it also helps that we not need to memset each time :)

normanmaurer · 2018-03-29T19:50:36Z

transport-native-epoll/src/main/c/netty_epoll_native.c

s/rescedule/reschedule/

normanmaurer · 2018-03-29T19:54:45Z

@carl-mastrangelo btw if you want to expose some methods to improve things thats definitely possible :)

carl-mastrangelo · 2018-03-29T19:56:03Z

@normanmaurer That would involve potentially exposing ScheduledFutureTask, which is prying open the API a little too much for this PR. I can look to see if there is a clean way to do so, but I am not entirely clear how much API you are comfortable exposing.

normanmaurer · 2018-03-29T19:56:47Z

@carl-mastrangelo yeah that for sure for a followup and I would need to see the code I guess.

normanmaurer · 2018-03-29T20:04:41Z

common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java

@carl-mastrangelo can you add some java docs ?

normanmaurer · 2018-03-29T20:05:45Z

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java

@carl-mastrangelo nit: can you move it to the other variables ?

Scottmitch · 2018-03-29T20:06:12Z

transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java

can we initialize to AbstractScheduledEventExecutor.nanoTime()?

also move this to top of the class with other variable declarations.

Scottmitch · 2018-03-29T20:07:37Z

transport-native-epoll/src/main/c/netty_epoll_native.c

can we just use -1 instead? This would be a more accessible value for C/JNI.

Sounds fine. I wanted to use something obviously wrong as a sentinel.

Scottmitch · 2018-03-29T20:23:11Z

common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java

This topic comes up in frequently enough that I usually forget the answer .... "why do we need to have a max wakeup time of 1 second instead of unbounded". IIRC the answer is when we schedule tasks we don't check if the new time results in a sooner deadline than the previous wakeup on the EventLoop. I think I investigated other events from outside the EventLoop and they should be invoking wakeup (and we have an atomic boolean protecting wakeup to avoid too many system calls). I wonder if now that we are saving the prevDeadlineNanos, and we have a timer mechanism independent of epoll_wait, if we can avoid waking up every second (and additional timerfd_settime system calls), and instead only wakeup when necessary.

Potential benefits:

if there are no scheduled task, we only wakeup when I/O occurs

we will be more responsive to schedule tasks (because we can adjust the wakeup to the minimum delay as new tasks are scheduled)

Potential negatives:

in the degenerate case if someone schedules tasks in decreasing order of expiration (10 seconds from now, 9 seconds, 8 seconds, etc...) we will have to touch the the timerfd_settime on each schedule. For outside the EventLoop we wouldn't have to wakeup the epoll_wait call (assuming timerfd_settime is thread safe).

@normanmaurer @carl-mastrangelo - wdyt? We could do the same thing for kqueue too (we would need to use a timerfd equivalent there too though).

I had assumed the periodic wakeup was some sort of safety net when I saw it. If the intent was to periodically check the scheduledTasks, then yeah, wakeup should probably be the correct mechanism, even if it results in timerfd_settime calls. Arguably there are already a ton of timerfd_settime calls when the group is idle. Doesn't seem all that bad of of a downside to me.

@Scottmitch that sounds fine to me... That said I think we should be careful and test this carefully :)

Followup issue / PR ftw ... #7829

normanmaurer · 2018-03-30T07:50:59Z

transport-native-epoll/src/main/c/netty_epoll_native.c

nit: Add whitespace like ((jint) -1)

normanmaurer · 2018-03-30T07:52:54Z

@carl-mastrangelo please address my comment and fix checkstyle

Scottmitch · 2018-04-01T00:04:40Z

common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java

I may remove this to address #7829 ... so lets hold of on releasing this, or agree this API can be broken in future releases.

@normanmaurer ^

normanmaurer · 2018-08-31T06:47:44Z

@carl-mastrangelo can you tell me what exact params you use for the qps_client and qps_server here ?

Motivation: Optimizing the Epoll channel needs an objective measure of how fast it is. Modification: Add a simple, closed loop, ping-pong benchmark. Result: Benchmark can be used to measure netty#7816 Initial numbers: ``` Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong": 22614.403 ±(99.9%) 797.263 ops/s [Average] (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130 CI (99.9%): [21817.140, 23411.666] (assumes normal distribution) Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s ```

carl-mastrangelo · 2018-08-31T20:56:39Z

@normanmaurer I have since lost the original configuration, so I sent out #8249

normanmaurer · 2018-09-03T07:44:58Z

So I ran the benchmark code that you provided and here is the result:

4.1:

EpollSocketChannelBenchmark.pingPong  thrpt   20  22257.647 ? 1572.435  ops/s

PR #7834 (Scotts proposal):

EpollSocketChannelBenchmark.pingPong  thrpt   20  22314.322 ? 1410.634  ops/s

PR #7816 (Carls proposal):

EpollSocketChannelBenchmark.pingPong  thrpt   20  24386.525 ? 1426.995  ops/s

The results are very stable and can be re-produced everytime. With this results I think I would pull in #7816 changes for now as these give the largest improvement and are also very small in scope which is great as it reduces the risk of regressions etc.

If we feel it is worth it we can investigate why #7834 does not give us the improvement we hoped for tho.

@Scottmitch @carl-mastrangelo WDYT ?

normanmaurer · 2018-09-03T07:47:04Z

Benchmarks where done on a real machine with 24 cores:

cat /proc/cpuinfo  | grep processor | wc -l
24

Motivation: Optimizing the Epoll channel needs an objective measure of how fast it is. Modification: Add a simple, closed loop, ping-pong benchmark. Result: Benchmark can be used to measure #7816 Initial numbers: ``` Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong": 22614.403 ±(99.9%) 797.263 ops/s [Average] (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130 CI (99.9%): [21817.140, 23411.666] (assumes normal distribution) Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s ```

normanmaurer · 2018-09-04T16:58:53Z

Also @ejona86

Scottmitch · 2018-09-06T18:36:41Z

Interesting. I'm assuming the benchmark code is from #8249? There are (potential) benefits outside of ping/pong throughput use cases for #7834, and some of the interesting cases are:

timer accuracy (inside and outside the event loop). Decoupling the "task queue" from the "timer queue" enables us to independently manage timers and schedule them more accurately, and doesn't require waking up the EventLoop to take effect. IIRC previously timers outside the EventLoop were added to the execute queue, and wouldn't take effect until they were popped from the queue (see the next point).
less frequent wakeup of the EventLoop in the presence of timers. Previously ever timer event would have to wake up the event loop to process. We no longer have to wake up the event loop explicitly because the timer processing is decoupled.
more aggressive draining the task queue. The task queue draining was previously proportional to the amount of time we spent processing I/O. Depending on I/O frequency vs task queue insertion this may lead to insertion rate exceeding draining rate. This in combination with decoupling the timer events from this queue can help to reduce task queue pressure.

It doesn't look like the benchmark in #8249 would cover the above elements. At most it executes a single write outside the EventLoop and waits for the results, and there is just a single timer event during setup. In this case I would expect PR #7834 to trade less calls to timerfd_settime for a poll of the scheduledQueue (an MPSC) per epoll_wait, but this PR doesn't have the scheduleQueue and therefore less work in this case (but doesn't address the above issues).

I'm assuming both PRs are run based off the same commit (looks like this PR has some merge conflicts)?

carl-mastrangelo · 2018-09-06T22:26:54Z

@Scottmitch For your last case, doesn't that get triggered from the event fd not the timer fd?

Also I rebased to head.

normanmaurer · 2018-09-07T05:53:25Z

@Scottmitch I did run it from the commit by cherry-pick myself and fixed merge conflicts... I will respond to the reset later on when I have a bit more time.

normanmaurer · 2018-09-07T14:19:17Z

@Scottmitch I will need to dig more into the changes you did but this will take some time. That said I still think it would be a good first step to merge @carl-mastrangelo s changes and then we can still investigate in your changes. I don't see why this is not possible. I think the only thing we need to do is put @UnstableApi on deadlineNanos.

Scottmitch · 2018-09-07T14:23:15Z

@normanmaurer - as long as the new API is breakable, that sgtm. Having a more representative benchmark, or real-world scenario which simulates scenarios from my previous comment will be necessary to analyze the total impact.

normanmaurer · 2018-09-07T14:28:06Z

@carl-mastrangelo can you rebase + add the @UnstableAPI thingy and fix checkstyle.

@Scottmitch agree... that said I think this change here can only improve things and not make things worse so I think its a good first step.

Motivation: The Epoll transport checks to see if there are any scheduled tasks before entering epoll_wait, and resets the timerfd just before. This causes an extra syscall to timerfd_settime before doing any actual work. When scheduled tasks aren't added frequently, or tasks are added with later deadlines, this is unnecessary. Modification: Check the *deadline* of the peeked task in EpollEventLoop, rather than the *delay*. If it hasn't changed since last time, don't re-arm the timer Result: About 2us faster on gRPC RTT 50pct latency benchmarks. Before (2 runs for 5 minutes, 1 minute of warmup): ``` 50.0%ile Latency (in nanos): 64267 90.0%ile Latency (in nanos): 72851 95.0%ile Latency (in nanos): 78903 99.0%ile Latency (in nanos): 92327 99.9%ile Latency (in nanos): 119691 100.0%ile Latency (in nanos): 13347327 QPS: 14933 50.0%ile Latency (in nanos): 63907 90.0%ile Latency (in nanos): 73055 95.0%ile Latency (in nanos): 79443 99.0%ile Latency (in nanos): 93739 99.9%ile Latency (in nanos): 123583 100.0%ile Latency (in nanos): 14028287 QPS: 14936 ``` After: ``` 50.0%ile Latency (in nanos): 62123 90.0%ile Latency (in nanos): 70795 95.0%ile Latency (in nanos): 76895 99.0%ile Latency (in nanos): 90887 99.9%ile Latency (in nanos): 117819 100.0%ile Latency (in nanos): 14126591 QPS: 15387 50.0%ile Latency (in nanos): 61021 90.0%ile Latency (in nanos): 70311 95.0%ile Latency (in nanos): 76687 99.0%ile Latency (in nanos): 90887 99.9%ile Latency (in nanos): 119527 100.0%ile Latency (in nanos): 6351615 QPS: 15571 ```

carl-mastrangelo · 2018-09-10T23:36:45Z

@normanmaurer done

normanmaurer · 2018-09-11T11:38:48Z

@carl-mastrangelo merged... Thanks!

Motivation The recently-introduced event loop scheduling hooks can be exploited by the epoll transport to avoid waking the event loop when scheduling future tasks if there is a timer already set to wake up sooner. There is also a "default" timeout which will wake the event loop after 1 second if there are no pending future tasks. The performance impact of these wakeups themselves is likely negligible but there's significant overhead in having to re-arm the timer every time the event loop goes to sleep (see netty#7816). It's not 100% clear why this timeout was there originally but we're sure it's no longer needed. Modification Combine the existing volatile wakenUp and non-volatile prevDeadlineNanos fields into a single AtomicLong that stores the next scheduled wakeup time while the event loop is in epoll_wait, and is -1 while it is awake. Use this as a guard to debounce wakeups from both immediate scheduled tasks and future scheduled tasks, the latter using the new before/afterScheduledTaskSubmitted overrides and based on whether the new deadline occurs prior to an already-scheduled timer. A similar optimization was already added to NioEventLoop, but it still uses two separate volatiles. We should consider similar streamlining of that in a future update. Result Fewer event loop wakeups when scheduling future tasks, greatly reduced overhead when no future tasks are scheduled.

Motivation The recently-introduced event loop scheduling hooks can be exploited by the epoll transport to avoid waking the event loop when scheduling future tasks if there is a timer already set to wake up sooner. There is also a "default" timeout which will wake the event loop after 1 second if there are no pending future tasks. The performance impact of these wakeups themselves is likely negligible but there's significant overhead in having to re-arm the timer every time the event loop goes to sleep (see #7816). It's not 100% clear why this timeout was there originally but we're sure it's no longer needed. Modification Combine the existing volatile wakenUp and non-volatile prevDeadlineNanos fields into a single AtomicLong that stores the next scheduled wakeup time while the event loop is in epoll_wait, and is -1 while it is awake. Use this as a guard to debounce wakeups from both immediate scheduled tasks and future scheduled tasks, the latter using the new before/afterScheduledTaskSubmitted overrides and based on whether the new deadline occurs prior to an already-scheduled timer. A similar optimization was already added to NioEventLoop, but it still uses two separate volatiles. We should consider similar streamlining of that in a future update. Result Fewer event loop wakeups when scheduling future tasks, greatly reduced overhead when no future tasks are scheduled.

normanmaurer reviewed Mar 29, 2018

View reviewed changes

normanmaurer self-assigned this Mar 29, 2018

normanmaurer added the improvement label Mar 29, 2018

normanmaurer added this to the 4.1.23.Final milestone Mar 29, 2018

normanmaurer reviewed Mar 29, 2018

View reviewed changes

carl-mastrangelo force-pushed the notimer branch 2 times, most recently from 2867589 to 4bdb97b Compare March 29, 2018 19:53

normanmaurer requested a review from Scottmitch March 29, 2018 19:56

normanmaurer reviewed Mar 29, 2018

View reviewed changes

Scottmitch reviewed Mar 29, 2018

View reviewed changes

normanmaurer reviewed Mar 30, 2018

View reviewed changes

transport-native-epoll/src/main/c/netty_epoll_native.c Outdated

Copy link
Copy Markdown

Member

normanmaurer Mar 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add whitespace like ((jint) -1)

normanmaurer mentioned this pull request Mar 30, 2018

Don't re-arm timerfd each epoll_wait apple/swift-nio#264

Merged

Scottmitch mentioned this pull request Mar 31, 2018

Improve Timer Wakeup for Native Transports #7829

Closed

Scottmitch reviewed Apr 1, 2018

View reviewed changes

Scottmitch mentioned this pull request Apr 3, 2018

EPOLL - decouple schedule tasks from epoll_wait life cycle #7834

Merged

normanmaurer modified the milestones: 4.1.23.Final, 4.1.24.Final Apr 4, 2018

normanmaurer modified the milestones: 4.1.24.Final, 4.1.25.Final Apr 19, 2018

normanmaurer modified the milestones: 4.1.25.Final, 4.1.26.Final May 14, 2018

normanmaurer modified the milestones: 4.1.26.Final, 4.1.27.Final Jul 10, 2018

normanmaurer added this to the 4.1.29.Final milestone Jul 27, 2018

normanmaurer modified the milestones: 4.1.29.Final, 4.1.30.Final Aug 24, 2018

carl-mastrangelo mentioned this pull request Aug 31, 2018

Add an Epoll benchmark #8249

Closed

carl-mastrangelo force-pushed the notimer branch from 07a7430 to 17dd5ec Compare September 6, 2018 22:24

carl-mastrangelo mentioned this pull request Sep 10, 2018

Added option to do busy-wait on epoll #8267

Merged

carl-mastrangelo force-pushed the notimer branch from c41623c to 6d87451 Compare September 10, 2018 23:36

normanmaurer merged commit 1dff107 into netty:4.1 Sep 11, 2018

wilkinsona mentioned this pull request Oct 19, 2018

epoll_wait produces an EINVAL error since 4.1.30 #8348

Closed

wilkinsona mentioned this pull request Oct 30, 2018

epoll_wait fails with "timerfd_settime() failed: Invalid argument" due to tv_sec and tv_nsec being -1 #8444

Closed

njhill mentioned this pull request Oct 12, 2019

Avoid unnecessary event loop wakeups #9605

Merged

Uh oh!

Conversation

carl-mastrangelo commented Mar 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carl-mastrangelo commented Mar 29, 2018

Uh oh!

normanmaurer commented Mar 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Mar 29, 2018

Uh oh!

carl-mastrangelo commented Mar 29, 2018

Uh oh!

normanmaurer commented Mar 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Mar 30, 2018

Uh oh!

Scottmitch Apr 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

normanmaurer commented Aug 31, 2018

Uh oh!

carl-mastrangelo commented Aug 31, 2018

Uh oh!

normanmaurer commented Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanmaurer commented Sep 3, 2018

Uh oh!

normanmaurer commented Sep 4, 2018

Uh oh!

Scottmitch commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carl-mastrangelo commented Sep 6, 2018

Uh oh!

normanmaurer commented Sep 7, 2018

Uh oh!

normanmaurer commented Sep 7, 2018

Uh oh!

Scottmitch commented Sep 7, 2018

Uh oh!

Scottmitch Apr 1, 2018 •

edited

Loading

normanmaurer commented Sep 3, 2018 •

edited

Loading

Scottmitch commented Sep 6, 2018 •

edited

Loading