Skip to content

Fix event loop shutdown timing fragility#9616

Merged
normanmaurer merged 2 commits intonetty:4.1from
njhill:el-shutdown-fix
Oct 7, 2019
Merged

Fix event loop shutdown timing fragility#9616
normanmaurer merged 2 commits intonetty:4.1from
njhill:el-shutdown-fix

Conversation

@njhill
Copy link
Copy Markdown
Member

@njhill njhill commented Sep 27, 2019

Motivation

The current event loop shutdown logic is quite fragile and in the epoll/NIO cases relies on the default 1 second wait/select timeout that applies when there are no scheduled tasks. Without this default timeout the shutdown would hang indefinitely.

The timeout only takes effect in this case because queued scheduled tasks are first cancelled in
SingleThreadEventExecutor#confirmShutdown(), but I think even this isn't robust, since the main task queue is subsequently serviced which could result in some new scheduled task being queued with much later deadline.

It also means shutdowns are unnecessarily delayed by up to 1 second.

Modifications

  • Add/extend unit tests to expose the issue
  • Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods
    to explicitly add no-op tasks to the taskQueue so that the subsequent event loop iteration doesn't enter blocking wait (as looks like was originally intended)

Results

Faster and more robust shutdown of event loops, allows removal of the default wait timeout

Motivation

The current event loop shutdown logic is quite fragile and in the
epoll/NIO cases relies on the default 1 second wait/select timeout that
applies when there are no scheduled tasks. Without this default timeout
the shutdown would hang indefinitely.

The timeout only takes effect in this case because queued scheduled
tasks are first cancelled in
SingleThreadEventExecutor#confirmShutdown(), but I _think_ even this
isn't robust, since the main task queue is subsequently serviced which
could result in some new scheduled task being queued with much later
deadline.

It also means shutdowns are unnecessarily delayed by up to 1 second.

Modifications

- Add/extend unit tests to expose the issue
- Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods
to explicitly add no-op tasks to the taskQueue so that the subsequent
event loop iteration doesn't enter blocking wait (as looks like was
originally intended)

Results

Faster and more robust shutdown of event loops, allows removal of the
default wait timeout
@netty-bot
Copy link
Copy Markdown

Can one of the admins verify this patch?

@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

@normanmaurer
Copy link
Copy Markdown
Member

@njhill you missed to adjust KQueueEventLoopTest.

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Sep 27, 2019

Oops, that's because I don't have a mac :), let me fix and then go to sleep...

@normanmaurer normanmaurer added this to the 4.1.43.Final milestone Sep 27, 2019
@normanmaurer
Copy link
Copy Markdown
Member

@netty-bot test this please

wakeup(true);
taskQueue.offer(WAKEUP_TASK);
try {
Thread.sleep(100);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this sleep?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnou I think so yes since it's how the graceful shutdown / quiet period is currently implemented. That said polling with sleep for this does seem unnecessarily cumbersome (as hinted by the TODO comment above)... probably something for a follow-on improvement.

@Test
public void shutdownGracefullyZeroQuietBeforeStart() throws Exception {
EventLoopGroup group = newEventLoopGroup();
assertTrue(group.shutdownGracefully(0L, 2L, TimeUnit.SECONDS).await(200L));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200ms seems a bit low, I see this failing on a loaded CI server.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could be bumped a little, but 200ms seems to me like it should be long enough. The timeouts are intentionally low (< 1 sec) since that's what's currently needed to expose the underlying problem. As explained in the description the bug is largely "masked" by the 1sec default wait/select timeout.

Copy link
Copy Markdown
Contributor

@johnou johnou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments, timeouts in the tests seem a little low.

@njhill
Copy link
Copy Markdown
Member Author

njhill commented Sep 27, 2019

I have a hunch that this might be the cause of or implicated in many of the mystery shutdown delays/issues reported over the years, such as: #2545, #8278, #8306, #7694, #5308, #4241, #6596, #7901.

@normanmaurer
Copy link
Copy Markdown
Member

@njhill amazing work! Thanks so much for this

@normanmaurer normanmaurer self-requested a review October 7, 2019 07:05
@normanmaurer normanmaurer merged commit 170e4de into netty:4.1 Oct 7, 2019
@njhill njhill deleted the el-shutdown-fix branch October 7, 2019 07:24
normanmaurer added a commit that referenced this pull request Oct 7, 2019
Motivation

The current event loop shutdown logic is quite fragile and in the
epoll/NIO cases relies on the default 1 second wait/select timeout that
applies when there are no scheduled tasks. Without this default timeout
the shutdown would hang indefinitely.

The timeout only takes effect in this case because queued scheduled
tasks are first cancelled in
SingleThreadEventExecutor#confirmShutdown(), but I _think_ even this
isn't robust, since the main task queue is subsequently serviced which
could result in some new scheduled task being queued with much later
deadline.

It also means shutdowns are unnecessarily delayed by up to 1 second.

Modifications

- Add/extend unit tests to expose the issue
- Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods
to explicitly add no-op tasks to the taskQueue so that the subsequent
event loop iteration doesn't enter blocking wait (as looks like was
originally intended)

Results

Faster and more robust shutdown of event loops, allows removal of the default wait timeout.
This is a port of #9616
normanmaurer added a commit that referenced this pull request Oct 8, 2019
Motivation

The current event loop shutdown logic is quite fragile and in the
epoll/NIO cases relies on the default 1 second wait/select timeout that
applies when there are no scheduled tasks. Without this default timeout
the shutdown would hang indefinitely.

The timeout only takes effect in this case because queued scheduled
tasks are first cancelled in
SingleThreadEventExecutor#confirmShutdown(), but I _think_ even this
isn't robust, since the main task queue is subsequently serviced which
could result in some new scheduled task being queued with much later
deadline.

It also means shutdowns are unnecessarily delayed by up to 1 second.

Modifications

- Add/extend unit tests to expose the issue
- Adjust SingleThreadEventExecutor shutdown and confirmShutdown methods
to explicitly add no-op tasks to the taskQueue so that the subsequent
event loop iteration doesn't enter blocking wait (as looks like was
originally intended)

Results

Faster and more robust shutdown of event loops, allows removal of the default wait timeout.
This is a port of #9616
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants