EPOLL - decouple schedule tasks from epoll_wait life cycle#7834
EPOLL - decouple schedule tasks from epoll_wait life cycle#7834normanmaurer merged 3 commits intonetty:4.1from
Conversation
|
more testing / analysis is required, but figured I would push this upstream to see if everyone is on board. @carl-mastrangelo - would you mind running the same benchmarks you ran for PR #7816? |
There was a problem hiding this comment.
Can't we just use Delayed ?
https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Delayed.html
There was a problem hiding this comment.
We want to get the deadline, not the delay.
There was a problem hiding this comment.
@Scottmitch that said should we just make this part of the Abstract base class and mark as unstable ?
There was a problem hiding this comment.
What abstract base class? The issue is we need it to be accessible outside the package (PromiseTask and ScheduledFutureTask are package private). I considered putting a protected interface in AbstractScheduledEventExecutor and can do this to reduce the exposure of this API. I will push this change and we can discuss more if necessary.
On a related note ... the relationship between ScheduledFutureTask and AbstractScheduledEventExecutor is very tightly coupled. I would like to tease these two apart in general as there a assumptions needed about using the "correct" nanoTime that may be made less "leaky" for the next major release (I added something to the doc).
transport/src/main/java/io/netty/channel/SingleThreadEventLoop.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This should have a throws as it may throw a ClosedChannelException
There was a problem hiding this comment.
Good catch (its actually ChannelException at the moment).
There was a problem hiding this comment.
However this is a runtime exception ... would you prefer a IOException using the errno like in epollWait?
There was a problem hiding this comment.
yeah I think this would be better (using NativeIOException)
There was a problem hiding this comment.
I just went with IOException for now ... avoids additional conditional checking and the string description of the error is already included in the IoExecution thrown by the JNI code.
There was a problem hiding this comment.
Just use an AtomicBoolean as we use it as a boolean.
There was a problem hiding this comment.
I considered this, however it involves additional conditional behind the scenes. Since this is internal I think we can just use AtomicInteger.
I also moved away from the AtomicFieldUpdater because:
- the excepted number of instances for this class is low
- the atomic field updaters have an additional "instance check" which can be avoided.
|
@Scottmitch Thanks for this... in general this looks ok. That said I would be interested to see how it compares to #7816 and if it worth it compared to the "easier" solution that was used in #7816 |
|
Agreed more analysis is necessary. This PR does a lot of things PR #7816 doesn't do, so lets see if I can quantify it some how :) |
There was a problem hiding this comment.
nit: I think you need to allow negative values here. If you were sorting task based on their delay rather than their deadline, you still need to keep track of which task is most tardy.
There was a problem hiding this comment.
I agree this could likely be changed, but is pre-existing behavior and doesn't impact this PR. Note that we are sorting, and determining "next to expire", based upon deadline (not delay).
Lets discuss in a followup issue: #7840
common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java
Outdated
Show resolved
Hide resolved
common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java
Outdated
Show resolved
Hide resolved
transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java
Outdated
Show resolved
Hide resolved
carl-mastrangelo
left a comment
There was a problem hiding this comment.
LGTM, no other concerns from me.
|
@carl-mastrangelo - thanks for review! would you be able to run the same benchmark you did in PR #7816? I'm also planning on doing some analysis, but the more data points the better. |
659ab6c to
b363996
Compare
|
Sorry I didn't have a chance to review your PR yet, @Scottmitch. Will find some time soon. |
|
How can we get this artifact jar or artificatId for a maven dependency? |
|
@M-AJ I can build one for you if you are interested otherwise you would just apply the patch and build it yourself. |
|
@normanmaurer If you can build one, that'd be great! Really appreciate it! I've tried building it locally but the process takes a while for me. |
b363996 to
5e74a0d
Compare
|
I just rebased to make getting recent fixes easier ... sorry super busy at the moment ... this is still on the backlog and hopefully will get to this soon. However it would be much appreciated if others could also try this out though 👍 |
|
@M-AJ you want a netty-all jar or just the epoll jar ? |
|
@normanmaurer netty-all would be perfect. Appreciate it! |
5e74a0d to
9c380c4
Compare
|
Hey @normanmaurer, any update on the build? Can't wait to try it out and see how it performs! |
|
Sorry it took me some time :( Try this one: |
|
@normanmaurer - I updated the commit message. let me know if it makes more sense now. |
There was a problem hiding this comment.
Doesn't this method return false on removal?
There was a problem hiding this comment.
I'm curious if the parameter type of this method can be widen to ScheduledTask or even Runnable. If so, we could hide ScheduledTaskRunnable (and even ScheduledTask) completely from the public API, i.e. less maintenance burden.
If a subclass needs the information like deadline, we could just add it as an additional parameter.
There was a problem hiding this comment.
The way this suggestion was implemented makes things more complex IMHO. We are trying to communicate from here that the type of task is for scheduled execution, and this information is used conditionally at higher layers (e.g. should we wake up). This has been changed to introduce 3 methods (e.g. beforeScheduledTaskSubmitted, afterScheduledTaskSubmitted, wakesUpForScheduledRunnable) and 1 additional type is exposed NonWakeupRunnable publicly which may require conditional allocation/wrapping at higher layers. IOW we are losing information by just allocating a Runnable here and have to compensate by complicating method signatures and type hierarchy at other layers.
I'll submit a followup PR (since this plus an additional PR was recently reverted) so we can take a look. /cc @normanmaurer @njhill
There was a problem hiding this comment.
Thanks @Scottmitch. I gave this quite a lot of thought so would be good to discuss more. A few initial responses though:
- The prior communication depended on the notion of an intermediate task which happens to involve enqueing some other scheduled task onto the internal scheduled-task queue (i.e. not the scheduled task itself), which is something we should not be exposing at all imho... that seems like an internal detail which EL impls shouldn't be aware of.
- I don't see
wakesUpForScheduledRunnablethat you referenced, do you meanwakesUpForRunnableorexecuteScheduledRunnable? - The only reason I left
wakesUpForRunnablethere is that it's protected and so would be a breaking change to remove it, I'd suggest that we deprecate it. A publicNonWakeupRunnableinterface serves a similar purpose but makes much more sense imho. Whether a given task is required to be run immediately is in general a property of that task and should be independent of the executor/EL implementation. We could consider alazyExecute(Runnable)method in addition to this (might be be nice to have on theEventExecutorinterface in netty 5). - I actually think we should remove
executeScheduledRunnablealtogether (possible since it's package-private), and instead just overrideschedule(...)inSingleThreadEventExecutor. This would further simplify things and avoid the doubleRunnablewrapping. Not sure it's also what you had in mind but it's a small change, I'll also push an example of that - Only the changes to
EpollEventLoopitself were reverted for the release, the reworked superclass abstractions are still there. I'm still very interested to find out what the issues were ... @normanmaurer any luck with a reproducer or more clues? :)
There was a problem hiding this comment.
@Scottmitch I made the changes alluded to above here 3923176 but took things a bit further still, uplifting the before/after hooks to AbstractScheduledEventExecutor and introducing higher level idea of lazyExecute / LazyRunnable which could have general applicability.
I really think this is both simpler to exploit and more generally useful for subclasses (i.e. event loop impls).
There was a problem hiding this comment.
Ditto - can ScheduledTask be widen to Runnable?
|
@netty-bot test this please |
d23b852 to
dc2cddf
Compare
Motivation: EPOLL supports decoupling the timed wakeup mechanism from the selector call. The EPOLL transport takes advantage of this in order to offer more fine grained timer resolution. However we are current calling timerfd_settime on each call to epoll_wait and this is expensive. We don't have to re-arm the timer on every call to epoll_wait and instead only have to arm the timer when a task is scheduled with an earlier expiration than any other existing scheduled task. Modifications: - Before scheduled tasks are added to the task queue, we determine if the new duration is the soonest to expire, and if so update with timerfd_settime. We also drain all the tasks at the end of the event loop to make sure we service any expired tasks and get an accurate next time delay. - EpollEventLoop maintains a volatile variable which represents the next deadline to expire. This variable is modified inside the event loop thread (before calling epoll_wait) and out side the event loop thread (immediately to ensure proper wakeup time). - Execute the task queue before the schedule task priority queue. This means we may delay the processing of scheduled tasks but it ensures we transfer all pending tasks from the task queue to the scheduled priority queue to run the soonest to expire scheduled task first. - Deprecate IORatio on EpollEventLoop, and drain the executor and scheduled queue on each event loop wakeup. Coupling the amount of time we are allowed to drain the executor queue to a proportion of time we process inbound IO may lead to unbounded queue sizes and unpredictable latency. Result: Fixes netty#7829 - In most cases this results in less calls to timerfd_settime - Less event loop wakeups just to check for scheduled tasks executed outside the event loop - More predictable executor queue and scheduled task queue draining - More accurate and responsive scheduled task execution
dc4ceba to
834763e
Compare
|
This is sitting here for a long time now... Let me merge this. |
njhill
left a comment
There was a problem hiding this comment.
@Scottmitch @normanmaurer apologies for the belated review of this. I think the general change is great, as well as much of the impl.
I made a few comments inline but in general have some reservations regarding the new abstractions exposed by the superclasses to allow EventLoop impls to integrate with the common timed-task scheduling logic.
I'll put more detail as a comment directly in the PR but my immediate concern is that once these APIs ship we can't easily change them, and I think they could be simpler.
| // See also https://stackoverflow.com/a/12492308/1074097 | ||
| } else if (fd == timerFd.intValue()) { | ||
| // consume wakeup event, necessary because the timer is added with ET mode. | ||
| Native.timerFdRead(fd); |
There was a problem hiding this comment.
May be my misunderstanding, but wouldn't ET imply this read isn't needed?
| * @return {@code true} if at least {@link Runnable#run()} was called. | ||
| */ | ||
| private boolean runExistingTasksFrom(Queue<Runnable> taskQueue) { | ||
| return taskQueue.offer(BOOKEND_TASK) ? runExistingTasksUntilBookend(taskQueue) |
There was a problem hiding this comment.
I was thinking the same as @carl-mastrangelo's earlier comment here - that the bookend could be easily avoided by just polling size() times. It would be cheaper and simpler, e.g. there's no need to deal specially with the bounded/full case. The JCTools impls won't involve any iteration when called from the consumer.
There was a problem hiding this comment.
good idea! calling from consumer thread for MPSC shouldn't require looping.
| return efd; | ||
| } | ||
|
|
||
| static void netty_epoll_native_timerFdSetTime(JNIEnv* env, jclass clazz, jint timerFd, jint tvSec, jint tvNsec) { |
There was a problem hiding this comment.
Did you ever consider changing to use TFD_TIMER_ABSTIME? We could just measure/store the offset of System.nanoTime() from the system clock during initialization and then use deadlines rather than delays across the board to further avoid continual absolute/relative conversion.
There was a problem hiding this comment.
No I didn't consider this.
| // lower value. | ||
| nextDeadline = nextDeadlineNanos.get(); | ||
| if (nextDeadline - candidateNextDeadline < 0) { | ||
| setTimerFd(deadlineToDelayNanos(nextDeadline)); |
There was a problem hiding this comment.
I'm not sure that this fully addresses the race condition. I think access to setTimerFd probably needs to be synchronized.
| @Override | ||
| public void run() { | ||
| try { | ||
| assertEquals(1, Native.epollWait(epoll, eventArray, timerFd, -1, -1)); |
There was a problem hiding this comment.
Instead of removing this test why not just adjust it to use the new equivalent Native.epollWaitNoTimeout method?
|
Want to do a PR to show the ideas ?
… Am 15.08.2019 um 22:48 schrieb Nick Hill ***@***.***>:
@njhill commented on this pull request.
@Scottmitch @normanmaurer apologies for the belated review of this. I think the general change is great, as well as much of the impl.
I made a few comments inline but in general have some reservations regarding the new abstractions exposed by the superclasses to allow EventLoop impls to integrate with the common timed-task scheduling logic.
I'll put more detail as a comment directly in the PR but my immediate concern is that once these APIs ship we can't easily change them, and I think they could be simpler.
In transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java:
> // Just ignore as we use ET mode for the eventfd and timerfd.
//
// See also https://stackoverflow.com/a/12492308/1074097
+ } else if (fd == timerFd.intValue()) {
+ // consume wakeup event, necessary because the timer is added with ET mode.
+ Native.timerFdRead(fd);
May be my misunderstanding, but wouldn't ET imply this read isn't needed?
In common/src/main/java/io/netty/util/concurrent/SingleThreadEventExecutor.java:
> @@ -397,6 +455,55 @@ protected final boolean runAllTasksFrom(Queue<Runnable> taskQueue) {
}
}
+ /**
+ * What ever tasks are present in ***@***.*** taskQueue} when this method is invoked will be ***@***.*** Runnable#run()}.
+ * @param taskQueue the task queue to drain.
+ * @return ***@***.*** true} if at least ***@***.*** Runnable#run()} was called.
+ */
+ private boolean runExistingTasksFrom(Queue<Runnable> taskQueue) {
+ return taskQueue.offer(BOOKEND_TASK) ? runExistingTasksUntilBookend(taskQueue)
I was thinking the same as @carl-mastrangelo's earlier comment here - that the bookend could be easily avoided by just polling size() times. It would be cheaper and simpler, e.g. there's no need to deal specially with the bounded/full case. The JCTools impls won't involve any iteration when called from the consumer.
In transport-native-epoll/src/main/c/netty_epoll_native.c:
> @@ -186,6 +186,31 @@ static jint netty_epoll_native_epollCreate(JNIEnv* env, jclass clazz) {
return efd;
}
+static void netty_epoll_native_timerFdSetTime(JNIEnv* env, jclass clazz, jint timerFd, jint tvSec, jint tvNsec) {
Did you ever consider changing to use TFD_TIMER_ABSTIME? We could just measure/store the offset of System.nanoTime() from the system clock during initialization and then use deadlines rather than delays across the board to further avoid continual absolute/relative conversion.
In transport-native-epoll/src/main/java/io/netty/channel/epoll/EpollEventLoop.java:
> + }
+
+ private void trySetTimerFd(long candidateNextDeadline) throws IOException {
+ for (;;) {
+ long nextDeadline = nextDeadlineNanos.get();
+ if (nextDeadline - candidateNextDeadline <= 0) {
+ break;
+ }
+ if (nextDeadlineNanos.compareAndSet(nextDeadline, candidateNextDeadline)) {
+ setTimerFd(deadlineToDelayNanos(candidateNextDeadline));
+ // We are setting the timerFd outside of the EventLoop so it is possible that we raced with another call
+ // to set the timer and temporarily increased the value, in which case we should set it back to the
+ // lower value.
+ nextDeadline = nextDeadlineNanos.get();
+ if (nextDeadline - candidateNextDeadline < 0) {
+ setTimerFd(deadlineToDelayNanos(nextDeadline));
I'm not sure that this fully addresses the race condition. I think access to setTimerFd probably needs to be synchronized.
In transport-native-epoll/src/test/java/io/netty/channel/epoll/EpollTest.java:
> - @test(timeout = 5000)
- public void testEpollWaitWithTimeOutMinusOne() throws Exception {
- final EpollEventArray eventArray = new EpollEventArray(8);
- try {
- final FileDescriptor epoll = Native.newEpollCreate();
- final FileDescriptor timerFd = Native.newTimerFd();
- final FileDescriptor eventfd = Native.newEventFd();
- Native.epollCtlAdd(epoll.intValue(), timerFd.intValue(), Native.EPOLLIN);
- Native.epollCtlAdd(epoll.intValue(), eventfd.intValue(), Native.EPOLLIN);
-
- final AtomicReference<Throwable> ref = new AtomicReference<Throwable>();
- Thread t = new Thread(new Runnable() {
- @OverRide
- public void run() {
- try {
- assertEquals(1, Native.epollWait(epoll, eventArray, timerFd, -1, -1));
Instead of removing this test why not just adjust it to use the new equivalent Native.epollWaitNoTimeout method?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
|
@normanmaurer sure, stand by :) |
|
@normanmaurer see #9470 |
| if (scheduledTask.deadlineNanos() <= nanoTime) { | ||
| scheduledTaskQueue.remove(); | ||
| ScheduledFutureTask<?> scheduledTask = scheduledTaskQueue.peek(); | ||
| if (scheduledTask != null && scheduledTask.deadlineNanos() <= nanoTime) { |
There was a problem hiding this comment.
Replace direct comparison using nanoTime with differences ie scheduledTask.deadlineNanos() - nanoTime <= 0
There was a problem hiding this comment.
Netty's ScheduledFutureTask offsets all time stamps and they are generally compared using equality for deadlines while giving up half of the domain (e.g. overflow earlier). I agree it is confusing relative to typical use of System.nanoTime() and we may want to revisit how time duration/expiration is done in Netty 5 in this area.
There was a problem hiding this comment.
"earlier" being after only 300 years, rather than 600 :)
Motivation The epoll transport was updated in netty#7834 to decouple setting of the timerFd from the event loop, so that scheduling delayed tasks does not require waking up epoll_wait. To achieve this, new overridable hooks were added in the AbstractScheduledEventExecutor and SingleThreadEventExecutor superclasses. However, the minimumDelayScheduledTaskRemoved hook has no current purpose and I can't envisage a _practical_ need for it. Removing it would reduce complexity and avoid supporting this specific API indefinitely. We can add something similar later if needed but the opposite is not true. There also isn't a _nice_ way to use the abstractions for wakeup-avoidance optimizations in other EventLoops that don't have a decoupled timer. This PR replaces executeScheduledRunnable and wakesUpForScheduledRunnable with two new methods before/afterFutureTaskScheduled that have slightly different semantics: - They only apply to additions; given the current internals there's no practical use for removals - They allow per-submission wakeup decisions via a boolean return val, which makes them easier to exploit from other existing EL impls (e.g. NIO/KQueue) - They are subjectively "cleaner", taking just the deadline parameter and not exposing Runnables - For current EL/queue impls, only the "after" hook is really needed, but specialized blocking queue impls can conditionally wake on task submission (I have one lined up) Also included are further optimization/simplification/fixes to the timerFd manipulation logic. Modifications - Remove AbstractScheduledEventExecutor#minimumDelayScheduledTaskRemoved() and supporting methods - Uplift NonWakeupRunnable and corresponding default wakesUpForTask() impl from SingleThreadEventLoop to SingleThreadEventExecutor - Change executeScheduledRunnable() to be package-private, and have a final impl in SingleThreadEventExecutor which triggers new overridable hooks before/afterFutureTaskScheduled() - Remove unnecessary use of bookend tasks while draining the task queue - Use new hooks to add simpler wake-up avoidance optimization to NioEventLoop (primarily to demonstrate utility/simplicity) - Reinstate removed EpollTest class In EpollEventLoop: - Refactor to use only the new afterFutureTaskScheduled() hook for updating timerFd - Fix setTimerFd race condition using a monitor - Set nextDeadlineNanos to a negative value while the EL is awake and use this to block timer changes from outside the EL. Restore the known-set value prior to sleeping, updating timerFd first if necessary - Don't read from timerFd when processing expiry event Result - Cleaner API for integrating with different EL/queue timing impls - Fixed race condition to avoid missing scheduled wakeups - Eliminate unnecessary timerFd updates while EL is awake, and unnecessary expired timerFd reads - Avoid unnecessary scheduled-task wakeups when using NIO transport I did not yet further explore the suggestion of using TFD_TIMER_ABSTIME for the timerFd.
Motivation The epoll transport was updated in #7834 to decouple setting of the timerFd from the event loop, so that scheduling delayed tasks does not require waking up epoll_wait. To achieve this, new overridable hooks were added in the AbstractScheduledEventExecutor and SingleThreadEventExecutor superclasses. However, the minimumDelayScheduledTaskRemoved hook has no current purpose and I can't envisage a _practical_ need for it. Removing it would reduce complexity and avoid supporting this specific API indefinitely. We can add something similar later if needed but the opposite is not true. There also isn't a _nice_ way to use the abstractions for wakeup-avoidance optimizations in other EventLoops that don't have a decoupled timer. This PR replaces executeScheduledRunnable and wakesUpForScheduledRunnable with two new methods before/afterFutureTaskScheduled that have slightly different semantics: - They only apply to additions; given the current internals there's no practical use for removals - They allow per-submission wakeup decisions via a boolean return val, which makes them easier to exploit from other existing EL impls (e.g. NIO/KQueue) - They are subjectively "cleaner", taking just the deadline parameter and not exposing Runnables - For current EL/queue impls, only the "after" hook is really needed, but specialized blocking queue impls can conditionally wake on task submission (I have one lined up) Also included are further optimization/simplification/fixes to the timerFd manipulation logic. Modifications - Remove AbstractScheduledEventExecutor#minimumDelayScheduledTaskRemoved() and supporting methods - Uplift NonWakeupRunnable and corresponding default wakesUpForTask() impl from SingleThreadEventLoop to SingleThreadEventExecutor - Change executeScheduledRunnable() to be package-private, and have a final impl in SingleThreadEventExecutor which triggers new overridable hooks before/afterFutureTaskScheduled() - Remove unnecessary use of bookend tasks while draining the task queue - Use new hooks to add simpler wake-up avoidance optimization to NioEventLoop (primarily to demonstrate utility/simplicity) - Reinstate removed EpollTest class In EpollEventLoop: - Refactor to use only the new afterFutureTaskScheduled() hook for updating timerFd - Fix setTimerFd race condition using a monitor - Set nextDeadlineNanos to a negative value while the EL is awake and use this to block timer changes from outside the EL. Restore the known-set value prior to sleeping, updating timerFd first if necessary - Don't read from timerFd when processing expiry event Result - Cleaner API for integrating with different EL/queue timing impls - Fixed race condition to avoid missing scheduled wakeups - Eliminate unnecessary timerFd updates while EL is awake, and unnecessary expired timerFd reads - Avoid unnecessary scheduled-task wakeups when using NIO transport I did not yet further explore the suggestion of using TFD_TIMER_ABSTIME for the timerFd.
Motivation:
EPOLL supports decoupling the timed wakeup mechanism from the selector call. The EPOLL transport takes advantage of this in order to offer more fine grained timer resolution. However we are current calling timerfd_settime on each call to epoll_wait and this is expensive. We don't have to re-arm the timer on every call to epoll_wait and instead only have to arm the timer when a task is scheduled with an earlier expiration than any other existing scheduled task.
Modifications:
duration is the soonest to expire, and if so update with timerfd_settime. We
also drain all the tasks at the end of the event loop to make sure we service
any expired tasks and get an accurate next time delay.
may delay the processing of scheduled tasks but it ensures we transfer all
pending tasks from the task queue to the scheduled priority queue to run the
soonest to expire scheduled task first.
Result:
Fixes #7829