Don't re-arm timerfd each epoll_wait#7816
Conversation
There was a problem hiding this comment.
Can you add a comment what the 0x80000000 stands for ?
There was a problem hiding this comment.
I considered calling this JINT_MIN or something like that, but didnt know how to namespace this properly.
Sure, though.
|
FYI, if there are no scheduled tasks pending, this speedup is lost. The core problem is that |
|
@carl-mastrangelo nice one... I guess it also helps that we not need to memset each time :) |
2867589 to
4bdb97b
Compare
|
@carl-mastrangelo btw if you want to expose some methods to improve things thats definitely possible :) |
|
@normanmaurer That would involve potentially exposing ScheduledFutureTask, which is prying open the API a little too much for this PR. I can look to see if there is a clean way to do so, but I am not entirely clear how much API you are comfortable exposing. |
|
@carl-mastrangelo yeah that for sure for a followup and I would need to see the code I guess. |
There was a problem hiding this comment.
@carl-mastrangelo nit: can you move it to the other variables ?
There was a problem hiding this comment.
can we initialize to AbstractScheduledEventExecutor.nanoTime()?
also move this to top of the class with other variable declarations.
There was a problem hiding this comment.
can we just use -1 instead? This would be a more accessible value for C/JNI.
There was a problem hiding this comment.
Sounds fine. I wanted to use something obviously wrong as a sentinel.
There was a problem hiding this comment.
This topic comes up in frequently enough that I usually forget the answer .... "why do we need to have a max wakeup time of 1 second instead of unbounded". IIRC the answer is when we schedule tasks we don't check if the new time results in a sooner deadline than the previous wakeup on the EventLoop. I think I investigated other events from outside the EventLoop and they should be invoking wakeup (and we have an atomic boolean protecting wakeup to avoid too many system calls). I wonder if now that we are saving the prevDeadlineNanos, and we have a timer mechanism independent of epoll_wait, if we can avoid waking up every second (and additional timerfd_settime system calls), and instead only wakeup when necessary.
Potential benefits:
- if there are no scheduled task, we only wakeup when I/O occurs
- we will be more responsive to schedule tasks (because we can adjust the wakeup to the minimum delay as new tasks are scheduled)
Potential negatives:
- in the degenerate case if someone schedules tasks in decreasing order of expiration (10 seconds from now, 9 seconds, 8 seconds, etc...) we will have to touch the the
timerfd_settimeon each schedule. For outside the EventLoop we wouldn't have to wakeup theepoll_waitcall (assumingtimerfd_settimeis thread safe).
@normanmaurer @carl-mastrangelo - wdyt? We could do the same thing for kqueue too (we would need to use a timerfd equivalent there too though).
There was a problem hiding this comment.
I had assumed the periodic wakeup was some sort of safety net when I saw it. If the intent was to periodically check the scheduledTasks, then yeah, wakeup should probably be the correct mechanism, even if it results in timerfd_settime calls. Arguably there are already a ton of timerfd_settime calls when the group is idle. Doesn't seem all that bad of of a downside to me.
There was a problem hiding this comment.
@Scottmitch that sounds fine to me... That said I think we should be careful and test this carefully :)
There was a problem hiding this comment.
nit: Add whitespace like ((jint) -1)
|
@carl-mastrangelo please address my comment and fix checkstyle |
There was a problem hiding this comment.
I may remove this to address #7829 ... so lets hold of on releasing this, or agree this API can be broken in future releases.
|
@carl-mastrangelo can you tell me what exact params you use for the |
Motivation: Optimizing the Epoll channel needs an objective measure of how fast it is. Modification: Add a simple, closed loop, ping-pong benchmark. Result: Benchmark can be used to measure netty#7816 Initial numbers: ``` Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong": 22614.403 ±(99.9%) 797.263 ops/s [Average] (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130 CI (99.9%): [21817.140, 23411.666] (assumes normal distribution) Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s ```
|
@normanmaurer I have since lost the original configuration, so I sent out #8249 |
|
So I ran the benchmark code that you provided and here is the result: 4.1: PR #7834 (Scotts proposal): PR #7816 (Carls proposal): The results are very stable and can be re-produced everytime. With this results I think I would pull in #7816 changes for now as these give the largest improvement and are also very small in scope which is great as it reduces the risk of regressions etc. If we feel it is worth it we can investigate why #7834 does not give us the improvement we hoped for tho. @Scottmitch @carl-mastrangelo WDYT ? |
|
Benchmarks where done on a real machine with 24 cores: |
Motivation: Optimizing the Epoll channel needs an objective measure of how fast it is. Modification: Add a simple, closed loop, ping-pong benchmark. Result: Benchmark can be used to measure #7816 Initial numbers: ``` Result "io.netty.microbench.channel.epoll.EpollSocketChannelBenchmark.pingPong": 22614.403 ±(99.9%) 797.263 ops/s [Average] (min, avg, max) = (21093.160, 22614.403, 24977.387), stdev = 918.130 CI (99.9%): [21817.140, 23411.666] (assumes normal distribution) Benchmark Mode Cnt Score Error Units EpollSocketChannelBenchmark.pingPong thrpt 20 22614.403 ± 797.263 ops/s ```
|
Also @ejona86 |
|
Interesting. I'm assuming the benchmark code is from #8249? There are (potential) benefits outside of ping/pong throughput use cases for #7834, and some of the interesting cases are:
It doesn't look like the benchmark in #8249 would cover the above elements. At most it executes a single write outside the EventLoop and waits for the results, and there is just a single timer event during setup. In this case I would expect PR #7834 to trade less calls to I'm assuming both PRs are run based off the same commit (looks like this PR has some merge conflicts)? |
07a7430 to
17dd5ec
Compare
|
@Scottmitch For your last case, doesn't that get triggered from the event fd not the timer fd? Also I rebased to head. |
|
@Scottmitch I did run it from the commit by cherry-pick myself and fixed merge conflicts... I will respond to the reset later on when I have a bit more time. |
|
@Scottmitch I will need to dig more into the changes you did but this will take some time. That said I still think it would be a good first step to merge @carl-mastrangelo s changes and then we can still investigate in your changes. I don't see why this is not possible. I think the only thing we need to do is put |
|
@normanmaurer - as long as the new API is breakable, that sgtm. Having a more representative benchmark, or real-world scenario which simulates scenarios from my previous comment will be necessary to analyze the total impact. |
|
@carl-mastrangelo can you rebase + add the @Scottmitch agree... that said I think this change here can only improve things and not make things worse so I think its a good first step. |
Motivation: The Epoll transport checks to see if there are any scheduled tasks before entering epoll_wait, and resets the timerfd just before. This causes an extra syscall to timerfd_settime before doing any actual work. When scheduled tasks aren't added frequently, or tasks are added with later deadlines, this is unnecessary. Modification: Check the *deadline* of the peeked task in EpollEventLoop, rather than the *delay*. If it hasn't changed since last time, don't re-arm the timer Result: About 2us faster on gRPC RTT 50pct latency benchmarks. Before (2 runs for 5 minutes, 1 minute of warmup): ``` 50.0%ile Latency (in nanos): 64267 90.0%ile Latency (in nanos): 72851 95.0%ile Latency (in nanos): 78903 99.0%ile Latency (in nanos): 92327 99.9%ile Latency (in nanos): 119691 100.0%ile Latency (in nanos): 13347327 QPS: 14933 50.0%ile Latency (in nanos): 63907 90.0%ile Latency (in nanos): 73055 95.0%ile Latency (in nanos): 79443 99.0%ile Latency (in nanos): 93739 99.9%ile Latency (in nanos): 123583 100.0%ile Latency (in nanos): 14028287 QPS: 14936 ``` After: ``` 50.0%ile Latency (in nanos): 62123 90.0%ile Latency (in nanos): 70795 95.0%ile Latency (in nanos): 76895 99.0%ile Latency (in nanos): 90887 99.9%ile Latency (in nanos): 117819 100.0%ile Latency (in nanos): 14126591 QPS: 15387 50.0%ile Latency (in nanos): 61021 90.0%ile Latency (in nanos): 70311 95.0%ile Latency (in nanos): 76687 99.0%ile Latency (in nanos): 90887 99.9%ile Latency (in nanos): 119527 100.0%ile Latency (in nanos): 6351615 QPS: 15571 ```
c41623c to
6d87451
Compare
|
@normanmaurer done |
|
@carl-mastrangelo merged... Thanks! |
Motivation The recently-introduced event loop scheduling hooks can be exploited by the epoll transport to avoid waking the event loop when scheduling future tasks if there is a timer already set to wake up sooner. There is also a "default" timeout which will wake the event loop after 1 second if there are no pending future tasks. The performance impact of these wakeups themselves is likely negligible but there's significant overhead in having to re-arm the timer every time the event loop goes to sleep (see netty#7816). It's not 100% clear why this timeout was there originally but we're sure it's no longer needed. Modification Combine the existing volatile wakenUp and non-volatile prevDeadlineNanos fields into a single AtomicLong that stores the next scheduled wakeup time while the event loop is in epoll_wait, and is -1 while it is awake. Use this as a guard to debounce wakeups from both immediate scheduled tasks and future scheduled tasks, the latter using the new before/afterScheduledTaskSubmitted overrides and based on whether the new deadline occurs prior to an already-scheduled timer. A similar optimization was already added to NioEventLoop, but it still uses two separate volatiles. We should consider similar streamlining of that in a future update. Result Fewer event loop wakeups when scheduling future tasks, greatly reduced overhead when no future tasks are scheduled.
Motivation The recently-introduced event loop scheduling hooks can be exploited by the epoll transport to avoid waking the event loop when scheduling future tasks if there is a timer already set to wake up sooner. There is also a "default" timeout which will wake the event loop after 1 second if there are no pending future tasks. The performance impact of these wakeups themselves is likely negligible but there's significant overhead in having to re-arm the timer every time the event loop goes to sleep (see #7816). It's not 100% clear why this timeout was there originally but we're sure it's no longer needed. Modification Combine the existing volatile wakenUp and non-volatile prevDeadlineNanos fields into a single AtomicLong that stores the next scheduled wakeup time while the event loop is in epoll_wait, and is -1 while it is awake. Use this as a guard to debounce wakeups from both immediate scheduled tasks and future scheduled tasks, the latter using the new before/afterScheduledTaskSubmitted overrides and based on whether the new deadline occurs prior to an already-scheduled timer. A similar optimization was already added to NioEventLoop, but it still uses two separate volatiles. We should consider similar streamlining of that in a future update. Result Fewer event loop wakeups when scheduling future tasks, greatly reduced overhead when no future tasks are scheduled.
Motivation:
The Epoll transport checks to see if there are any scheduled tasks
before entering epoll_wait, and resets the timerfd just before.
This causes an extra syscall to timerfd_settime before doing any
actual work. When scheduled tasks aren't added frequently, or
tasks are added with later deadlines, this is unnecessary.
Modification:
Check the deadline of the peeked task in EpollEventLoop, rather
than the delay. If it hasn't changed since last time, don't
re-arm the timer
Result:
About 2us faster on gRPC RTT 50pct latency benhcmarks.
Before (2 runs for 5 minutes, 1 minute of warmup):
After: