Skip to content

Don't re-arm timerfd each epoll_wait#7816

Merged
normanmaurer merged 1 commit intonetty:4.1from
carl-mastrangelo:notimer
Sep 11, 2018
Merged

Don't re-arm timerfd each epoll_wait#7816
normanmaurer merged 1 commit intonetty:4.1from
carl-mastrangelo:notimer

Conversation

@carl-mastrangelo
Copy link
Copy Markdown
Member

Motivation:
The Epoll transport checks to see if there are any scheduled tasks
before entering epoll_wait, and resets the timerfd just before.
This causes an extra syscall to timerfd_settime before doing any
actual work. When scheduled tasks aren't added frequently, or
tasks are added with later deadlines, this is unnecessary.

Modification:
Check the deadline of the peeked task in EpollEventLoop, rather
than the delay. If it hasn't changed since last time, don't
re-arm the timer

Result:
About 2us faster on gRPC RTT 50pct latency benhcmarks.

Before (2 runs for 5 minutes, 1 minute of warmup):

50.0%ile Latency (in nanos):		64267
90.0%ile Latency (in nanos):		72851
95.0%ile Latency (in nanos):		78903
99.0%ile Latency (in nanos):		92327
99.9%ile Latency (in nanos):		119691
100.0%ile Latency (in nanos):		13347327
QPS:                           14933

50.0%ile Latency (in nanos):		63907
90.0%ile Latency (in nanos):		73055
95.0%ile Latency (in nanos):		79443
99.0%ile Latency (in nanos):		93739
99.9%ile Latency (in nanos):		123583
100.0%ile Latency (in nanos):		14028287
QPS:                           14936

After:

50.0%ile Latency (in nanos):		62123
90.0%ile Latency (in nanos):		70795
95.0%ile Latency (in nanos):		76895
99.0%ile Latency (in nanos):		90887
99.9%ile Latency (in nanos):		117819
100.0%ile Latency (in nanos):		14126591
QPS:                           15387

50.0%ile Latency (in nanos):		61021
90.0%ile Latency (in nanos):		70311
95.0%ile Latency (in nanos):		76687
99.0%ile Latency (in nanos):		90887
99.9%ile Latency (in nanos):		119527
100.0%ile Latency (in nanos):		6351615
QPS:                           15571

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment what the 0x80000000 stands for ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like "Integer.MIN_VALUE"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered calling this JINT_MIN or something like that, but didnt know how to namespace this properly.

Sure, though.

@carl-mastrangelo
Copy link
Copy Markdown
Member Author

FYI, if there are no scheduled tasks pending, this speedup is lost. The core problem is that AbstractScheduledEventExecutor is hiding all the useful methods from the child classes.

@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo nice one... I guess it also helps that we not need to memset each time :)

@normanmaurer normanmaurer self-assigned this Mar 29, 2018
@normanmaurer normanmaurer added this to the 4.1.23.Final milestone Mar 29, 2018
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/rescedule/reschedule/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@carl-mastrangelo carl-mastrangelo force-pushed the notimer branch 2 times, most recently from 2867589 to 4bdb97b Compare March 29, 2018 19:53
@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo btw if you want to expose some methods to improve things thats definitely possible :)

@carl-mastrangelo
Copy link
Copy Markdown
Member Author

@normanmaurer That would involve potentially exposing ScheduledFutureTask, which is prying open the API a little too much for this PR. I can look to see if there is a clean way to do so, but I am not entirely clear how much API you are comfortable exposing.

@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo yeah that for sure for a followup and I would need to see the code I guess.

@normanmaurer normanmaurer requested a review from Scottmitch March 29, 2018 19:56
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo can you add some java docs ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo nit: can you move it to the other variables ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we initialize to AbstractScheduledEventExecutor.nanoTime()?

also move this to top of the class with other variable declarations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just use -1 instead? This would be a more accessible value for C/JNI.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine. I wanted to use something obviously wrong as a sentinel.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This topic comes up in frequently enough that I usually forget the answer .... "why do we need to have a max wakeup time of 1 second instead of unbounded". IIRC the answer is when we schedule tasks we don't check if the new time results in a sooner deadline than the previous wakeup on the EventLoop. I think I investigated other events from outside the EventLoop and they should be invoking wakeup (and we have an atomic boolean protecting wakeup to avoid too many system calls). I wonder if now that we are saving the prevDeadlineNanos, and we have a timer mechanism independent of epoll_wait, if we can avoid waking up every second (and additional timerfd_settime system calls), and instead only wakeup when necessary.

Potential benefits:

  • if there are no scheduled task, we only wakeup when I/O occurs
  • we will be more responsive to schedule tasks (because we can adjust the wakeup to the minimum delay as new tasks are scheduled)

Potential negatives:

  • in the degenerate case if someone schedules tasks in decreasing order of expiration (10 seconds from now, 9 seconds, 8 seconds, etc...) we will have to touch the the timerfd_settime on each schedule. For outside the EventLoop we wouldn't have to wakeup the epoll_wait call (assuming timerfd_settime is thread safe).

@normanmaurer @carl-mastrangelo - wdyt? We could do the same thing for kqueue too (we would need to use a timerfd equivalent there too though).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed the periodic wakeup was some sort of safety net when I saw it. If the intent was to periodically check the scheduledTasks, then yeah, wakeup should probably be the correct mechanism, even if it results in timerfd_settime calls. Arguably there are already a ton of timerfd_settime calls when the group is idle. Doesn't seem all that bad of of a downside to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Scottmitch that sounds fine to me... That said I think we should be careful and test this carefully :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followup issue / PR ftw ... #7829

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add whitespace like ((jint) -1)

@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo please address my comment and fix checkstyle

Copy link
Copy Markdown
Member

@Scottmitch Scottmitch Apr 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may remove this to address #7829 ... so lets hold of on releasing this, or agree this API can be broken in future releases.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanmaurer normanmaurer added this to the 4.1.29.Final milestone Jul 27, 2018
@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo can you tell me what exact params you use for the qps_client and qps_server here ?

carl-mastrangelo added a commit to carl-mastrangelo/netty that referenced this pull request Aug 31, 2018
Motivation:
Optimizing the Epoll channel needs an objective measure of how fast
it is.

Modification:
Add a simple, closed loop,  ping-pong benchmark.

Result:
Benchmark can be used to measure netty#7816

Initial numbers:

```
Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong":
  22614.403 ±(99.9%) 797.263 ops/s [Average]
  (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130
  CI (99.9%): [21817.140, 23411.666] (assumes normal distribution)

Benchmark                              Mode  Cnt      Score     Error  Units
EpollSocketChannelBenchmark.pingPong  thrpt   20  22614.403 ± 797.263  ops/s
```
@carl-mastrangelo
Copy link
Copy Markdown
Member Author

@normanmaurer I have since lost the original configuration, so I sent out #8249

@normanmaurer
Copy link
Copy Markdown
Member

normanmaurer commented Sep 3, 2018

So I ran the benchmark code that you provided and here is the result:

4.1:

EpollSocketChannelBenchmark.pingPong  thrpt   20  22257.647 ? 1572.435  ops/s

PR #7834 (Scotts proposal):

EpollSocketChannelBenchmark.pingPong  thrpt   20  22314.322 ? 1410.634  ops/s

PR #7816 (Carls proposal):

EpollSocketChannelBenchmark.pingPong  thrpt   20  24386.525 ? 1426.995  ops/s

The results are very stable and can be re-produced everytime. With this results I think I would pull in #7816 changes for now as these give the largest improvement and are also very small in scope which is great as it reduces the risk of regressions etc.

If we feel it is worth it we can investigate why #7834 does not give us the improvement we hoped for tho.

@Scottmitch @carl-mastrangelo WDYT ?

@normanmaurer
Copy link
Copy Markdown
Member

Benchmarks where done on a real machine with 24 cores:

cat /proc/cpuinfo  | grep processor | wc -l
24

normanmaurer pushed a commit that referenced this pull request Sep 4, 2018
Motivation:
Optimizing the Epoll channel needs an objective measure of how fast
it is.

Modification:
Add a simple, closed loop,  ping-pong benchmark.

Result:
Benchmark can be used to measure #7816

Initial numbers:

```
Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong":
  22614.403 ±(99.9%) 797.263 ops/s [Average]
  (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130
  CI (99.9%): [21817.140, 23411.666] (assumes normal distribution)

Benchmark                              Mode  Cnt      Score     Error  Units
EpollSocketChannelBenchmark.pingPong  thrpt   20  22614.403 ± 797.263  ops/s
```
@normanmaurer
Copy link
Copy Markdown
Member

Also @ejona86

@Scottmitch
Copy link
Copy Markdown
Member

Scottmitch commented Sep 6, 2018

Interesting. I'm assuming the benchmark code is from #8249? There are (potential) benefits outside of ping/pong throughput use cases for #7834, and some of the interesting cases are:

  • timer accuracy (inside and outside the event loop). Decoupling the "task queue" from the "timer queue" enables us to independently manage timers and schedule them more accurately, and doesn't require waking up the EventLoop to take effect. IIRC previously timers outside the EventLoop were added to the execute queue, and wouldn't take effect until they were popped from the queue (see the next point).
  • less frequent wakeup of the EventLoop in the presence of timers. Previously ever timer event would have to wake up the event loop to process. We no longer have to wake up the event loop explicitly because the timer processing is decoupled.
  • more aggressive draining the task queue. The task queue draining was previously proportional to the amount of time we spent processing I/O. Depending on I/O frequency vs task queue insertion this may lead to insertion rate exceeding draining rate. This in combination with decoupling the timer events from this queue can help to reduce task queue pressure.

It doesn't look like the benchmark in #8249 would cover the above elements. At most it executes a single write outside the EventLoop and waits for the results, and there is just a single timer event during setup. In this case I would expect PR #7834 to trade less calls to timerfd_settime for a poll of the scheduledQueue (an MPSC) per epoll_wait, but this PR doesn't have the scheduleQueue and therefore less work in this case (but doesn't address the above issues).

I'm assuming both PRs are run based off the same commit (looks like this PR has some merge conflicts)?

@carl-mastrangelo
Copy link
Copy Markdown
Member Author

@Scottmitch For your last case, doesn't that get triggered from the event fd not the timer fd?

Also I rebased to head.

@normanmaurer
Copy link
Copy Markdown
Member

@Scottmitch I did run it from the commit by cherry-pick myself and fixed merge conflicts... I will respond to the reset later on when I have a bit more time.

@normanmaurer
Copy link
Copy Markdown
Member

@Scottmitch I will need to dig more into the changes you did but this will take some time. That said I still think it would be a good first step to merge @carl-mastrangelo s changes and then we can still investigate in your changes. I don't see why this is not possible. I think the only thing we need to do is put @UnstableApi on deadlineNanos.

@Scottmitch
Copy link
Copy Markdown
Member

@normanmaurer - as long as the new API is breakable, that sgtm. Having a more representative benchmark, or real-world scenario which simulates scenarios from my previous comment will be necessary to analyze the total impact.

@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo can you rebase + add the @UnstableAPI thingy and fix checkstyle.

@Scottmitch agree... that said I think this change here can only improve things and not make things worse so I think its a good first step.

Motivation:
The Epoll transport checks to see if there are any scheduled tasks
before entering epoll_wait, and resets the timerfd just before.
This causes an extra syscall to timerfd_settime before doing any
actual work.   When scheduled tasks aren't added frequently, or
tasks are added with later deadlines, this is unnecessary.

Modification:
Check the *deadline* of the peeked task in EpollEventLoop, rather
than the *delay*.  If it hasn't changed since last time, don't
re-arm the timer

Result:
About 2us faster on gRPC RTT 50pct latency benchmarks.

Before (2 runs for 5 minutes, 1 minute of warmup):

```
50.0%ile Latency (in nanos):		64267
90.0%ile Latency (in nanos):		72851
95.0%ile Latency (in nanos):		78903
99.0%ile Latency (in nanos):		92327
99.9%ile Latency (in nanos):		119691
100.0%ile Latency (in nanos):		13347327
QPS:                           14933

50.0%ile Latency (in nanos):		63907
90.0%ile Latency (in nanos):		73055
95.0%ile Latency (in nanos):		79443
99.0%ile Latency (in nanos):		93739
99.9%ile Latency (in nanos):		123583
100.0%ile Latency (in nanos):		14028287
QPS:                           14936
```

After:
```
50.0%ile Latency (in nanos):		62123
90.0%ile Latency (in nanos):		70795
95.0%ile Latency (in nanos):		76895
99.0%ile Latency (in nanos):		90887
99.9%ile Latency (in nanos):		117819
100.0%ile Latency (in nanos):		14126591
QPS:                           15387

50.0%ile Latency (in nanos):		61021
90.0%ile Latency (in nanos):		70311
95.0%ile Latency (in nanos):		76687
99.0%ile Latency (in nanos):		90887
99.9%ile Latency (in nanos):		119527
100.0%ile Latency (in nanos):		6351615
QPS:                           15571
```
@carl-mastrangelo
Copy link
Copy Markdown
Member Author

@normanmaurer done

@normanmaurer normanmaurer merged commit 1dff107 into netty:4.1 Sep 11, 2018
@normanmaurer
Copy link
Copy Markdown
Member

@carl-mastrangelo merged... Thanks!

njhill added a commit to njhill/netty that referenced this pull request Oct 12, 2019
Motivation

The recently-introduced event loop scheduling hooks can be exploited by
the epoll transport to avoid waking the event loop when scheduling
future tasks if there is a timer already set to wake up sooner.

There is also a "default" timeout which will wake the event
loop after 1 second if there are no pending future tasks. The
performance impact of these wakeups themselves is likely negligible but
there's significant overhead in having to re-arm the timer every time
the event loop goes to sleep (see netty#7816). It's not 100% clear why this
timeout was there originally but we're sure it's no longer needed.

Modification

Combine the existing volatile wakenUp and non-volatile prevDeadlineNanos
fields into a single AtomicLong that stores the next scheduled wakeup
time while the event loop is in epoll_wait, and is -1 while it is awake.

Use this as a guard to debounce wakeups from both immediate scheduled
tasks and future scheduled tasks, the latter using the new
before/afterScheduledTaskSubmitted overrides and based on whether the
new deadline occurs prior to an already-scheduled timer.

A similar optimization was already added to NioEventLoop, but it still
uses two separate volatiles. We should consider similar streamlining of
that in a future update.

Result

Fewer event loop wakeups when scheduling future tasks, greatly reduced
overhead when no future tasks are scheduled.
normanmaurer pushed a commit that referenced this pull request Oct 12, 2019
Motivation

The recently-introduced event loop scheduling hooks can be exploited by
the epoll transport to avoid waking the event loop when scheduling
future tasks if there is a timer already set to wake up sooner.

There is also a "default" timeout which will wake the event
loop after 1 second if there are no pending future tasks. The
performance impact of these wakeups themselves is likely negligible but
there's significant overhead in having to re-arm the timer every time
the event loop goes to sleep (see #7816). It's not 100% clear why this
timeout was there originally but we're sure it's no longer needed.

Modification

Combine the existing volatile wakenUp and non-volatile prevDeadlineNanos
fields into a single AtomicLong that stores the next scheduled wakeup
time while the event loop is in epoll_wait, and is -1 while it is awake.

Use this as a guard to debounce wakeups from both immediate scheduled
tasks and future scheduled tasks, the latter using the new
before/afterScheduledTaskSubmitted overrides and based on whether the
new deadline occurs prior to an already-scheduled timer.

A similar optimization was already added to NioEventLoop, but it still
uses two separate volatiles. We should consider similar streamlining of
that in a future update.

Result

Fewer event loop wakeups when scheduling future tasks, greatly reduced
overhead when no future tasks are scheduled.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants