Fix event loop shutdown timing fragility#9616
Conversation
Motivation The current event loop shutdown logic is quite fragile and in the epoll/NIO cases relies on the default 1 second wait/select timeout that applies when there are no scheduled tasks. Without this default timeout the shutdown would hang indefinitely. The timeout only takes effect in this case because queued scheduled tasks are first cancelled in SingleThreadEventExecutor#confirmShutdown(), but I _think_ even this isn't robust, since the main task queue is subsequently serviced which could result in some new scheduled task being queued with much later deadline. It also means shutdowns are unnecessarily delayed by up to 1 second. Modifications - Add/extend unit tests to expose the issue - Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods to explicitly add no-op tasks to the taskQueue so that the subsequent event loop iteration doesn't enter blocking wait (as looks like was originally intended) Results Faster and more robust shutdown of event loops, allows removal of the default wait timeout
|
Can one of the admins verify this patch? |
|
@netty-bot test this please |
|
@njhill you missed to adjust |
|
Oops, that's because I don't have a mac :), let me fix and then go to sleep... |
|
@netty-bot test this please |
| wakeup(true); | ||
| taskQueue.offer(WAKEUP_TASK); | ||
| try { | ||
| Thread.sleep(100); |
There was a problem hiding this comment.
do we still need this sleep?
There was a problem hiding this comment.
@johnou I think so yes since it's how the graceful shutdown / quiet period is currently implemented. That said polling with sleep for this does seem unnecessarily cumbersome (as hinted by the TODO comment above)... probably something for a follow-on improvement.
| @Test | ||
| public void shutdownGracefullyZeroQuietBeforeStart() throws Exception { | ||
| EventLoopGroup group = newEventLoopGroup(); | ||
| assertTrue(group.shutdownGracefully(0L, 2L, TimeUnit.SECONDS).await(200L)); |
There was a problem hiding this comment.
200ms seems a bit low, I see this failing on a loaded CI server.
There was a problem hiding this comment.
Maybe it could be bumped a little, but 200ms seems to me like it should be long enough. The timeouts are intentionally low (< 1 sec) since that's what's currently needed to expose the underlying problem. As explained in the description the bug is largely "masked" by the 1sec default wait/select timeout.
johnou
left a comment
There was a problem hiding this comment.
left a few comments, timeouts in the tests seem a little low.
|
@njhill amazing work! Thanks so much for this |
Motivation The current event loop shutdown logic is quite fragile and in the epoll/NIO cases relies on the default 1 second wait/select timeout that applies when there are no scheduled tasks. Without this default timeout the shutdown would hang indefinitely. The timeout only takes effect in this case because queued scheduled tasks are first cancelled in SingleThreadEventExecutor#confirmShutdown(), but I _think_ even this isn't robust, since the main task queue is subsequently serviced which could result in some new scheduled task being queued with much later deadline. It also means shutdowns are unnecessarily delayed by up to 1 second. Modifications - Add/extend unit tests to expose the issue - Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods to explicitly add no-op tasks to the taskQueue so that the subsequent event loop iteration doesn't enter blocking wait (as looks like was originally intended) Results Faster and more robust shutdown of event loops, allows removal of the default wait timeout. This is a port of #9616
Motivation The current event loop shutdown logic is quite fragile and in the epoll/NIO cases relies on the default 1 second wait/select timeout that applies when there are no scheduled tasks. Without this default timeout the shutdown would hang indefinitely. The timeout only takes effect in this case because queued scheduled tasks are first cancelled in SingleThreadEventExecutor#confirmShutdown(), but I _think_ even this isn't robust, since the main task queue is subsequently serviced which could result in some new scheduled task being queued with much later deadline. It also means shutdowns are unnecessarily delayed by up to 1 second. Modifications - Add/extend unit tests to expose the issue - Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods to explicitly add no-op tasks to the taskQueue so that the subsequent event loop iteration doesn't enter blocking wait (as looks like was originally intended) Results Faster and more robust shutdown of event loops, allows removal of the default wait timeout. This is a port of #9616
Motivation
The current event loop shutdown logic is quite fragile and in the epoll/NIO cases relies on the default 1 second wait/select timeout that applies when there are no scheduled tasks. Without this default timeout the shutdown would hang indefinitely.
The timeout only takes effect in this case because queued scheduled tasks are first cancelled in
SingleThreadEventExecutor#confirmShutdown(), but I think even this isn't robust, since the main task queue is subsequently serviced which could result in some new scheduled task being queued with much later deadline.It also means shutdowns are unnecessarily delayed by up to 1 second.
Modifications
SingleThreadEventExecutorshutdown and confirmShutdown methodsto explicitly add no-op tasks to the taskQueue so that the subsequent event loop iteration doesn't enter blocking wait (as looks like was originally intended)
Results
Faster and more robust shutdown of event loops, allows removal of the default wait timeout