Complete all transport handlers at shutdown by DaveCTurner · Pull Request #85131 · elastic/elasticsearch

DaveCTurner · 2022-03-19T22:04:54Z

Transport handlers may hold resources that need to be released, so we
must complete them all to avoid leaks. Also, the completion of a handler
may fork onto a different threadpool. Today we fork using a plain
Runnable in various places, but this means that the completion may be
rejected if the threadpool queue is full or the executor is shutting
down. This commit changes the behaviour to force execution even if the
queue is full, and to ensure that all handlers' completions are enqueued
with their respective executors before stopping the threadpool.

Closes #84948

Transport handlers may hold resources that need to be released, so we must complete them all to avoid leaks. Also, the completion of a handler may fork onto a different threadpool. Today we fork using a plain `Runnable` in various places, but this means that the completion may be rejected if the threadpool queue is full or the executor is shutting down. This commit changes the behaviour to force execution even if the queue is full, and to complete the handler directly if the target executor is shutting down. Closes elastic#84948

DaveCTurner · 2022-03-19T22:10:33Z

Labelling this >test because AFAICT today we don't appear to use any thread pools with bounded queues here so the only possibility of a leak is at shutdown and that doesn't really matter in production since we're shutting down anyway.

Also it's a draft because I think it deserves some more specific tests of its own - just opening it now for discussion and to see if CI finds anything interesting.

DaveCTurner · 2022-03-20T09:35:04Z

On reflection I don't really like the solution as of b1eea15. We should be able to fully close down the transport service (including completing any pending handlers) before stopping any thread pools, avoiding the need to complete handlers on SAME at all. The trouble appears to be that we remove handlers from the map before queueing up their completion so we aren't waiting for these. I suspect it's ok for remote responses because the acquire-handler-and-enqueue-complete flow happens on a transport thread and we wait for these threads to finish when shutting the transport service down, but that's not the case for direct responses which therefore might race against shutdown. Maybe we need a little bit of extra ref-counting to track these things. See 14b2bce.

DaveCTurner · 2022-03-20T17:33:08Z

@elasticmachine test this please

elasticmachine · 2022-03-21T08:14:55Z

Pinging @elastic/es-distributed (Team:Distributed)

…shutdown

original-brownbear

Thanks David, sorry this took me a little to get my head around. One question only basically, looks great in general.

original-brownbear · 2022-03-22T12:35:27Z

server/src/main/java/org/elasticsearch/transport/TransportService.java

-            final TransportResponseHandler<?> handler = service.responseHandlers.onResponseReceived(requestId, service);
-            // ignore if its null, the service logs it
-            if (handler != null) {
+            try (var shutdownBlock = service.pendingDirectHandlers.withRef()) {


I'm struggling with how this would work in case of a handler running on the generic pool. We increment the pending handlers here, then enqueue the task on the generic queue and then decrement.
Wouldn't we still complete the stop of the transport service now before that task has been executed and potentially just quietly kill the generic pool without ever running our ForkingResponseHandlerRunnable? If it's enqueued and shutdown is called on that queue, I don't think there's anything that notifies our task of the fact that it will never run?
Wouldn't we have to push the decrementing of the ref on shutdownBlock to the generic pool in this example to actually be 100% safe?

My thinking is that it's enough to enqueue everything, we don't have to wait for them all to execute, because we start the shutdown by calling ThreadPool#shutdown() which should drain the threadpool queue first. It's true that this doesn't properly guarantee that everything executes (we call shutdownNow() after 10s regardless) but in practice this only matters for tests which should be cleaning everything up in good time anyway.

I suppose we could assert that shutdownNow() returns an empty list in most tests to be more explicit about that. Not sure I want to do that here tho.

FWIW this mirrors the behaviour for handling remote responses which enqueue the work from the transport thread but don't guarantee that the handler is completed before the transport thread is stopped.

I suppose we could assert that shutdownNow() returns an empty list in most tests to be more explicit about that. Not sure I want to do that here tho.

Right ... that's what we actually want in a follow-up. I'm 95% sure that will trip here and there ...

Ah I checked and it looks like we almost have this already. We throw an AssertionError if graceful shutdown fails in ESSingleNodeTestCase already but apparently only an IOException in ESIntegTestCase. I think that would result in test failures already but for the avoidance of doubt I opened #85238 to make this a proper assertion.

original-brownbear

LGTM, makes sense to me thanks David! :)

We rely on nodes stopping gracefully in integ tests, rather than timing out and using `ThreadPool#shutdownNow` to ignore any remaining cleanup work. Today we throw an `IOException` if the shutdown wasn't graceful. With this commit we move to an `AssertionError` so we can be sure that any such problems result in a test failure. Relates elastic#85131

We rely on nodes stopping gracefully in integ tests, rather than timing out and using `ThreadPool#shutdownNow` to ignore any remaining cleanup work. Today we throw an `IOException` if the shutdown wasn't graceful. With this commit we move to an `AssertionError` so we can be sure that any such problems result in a test failure. Relates #85131

We introduced an assertion in elastic#85131 to check the assumption that there are no pending non-local response handlers when the `TransportService` shuts down. We later discovered that this assertion rarely tripped, which we fixed in elastic#86315, but that fix did not go into 8.2 so there are still rare failures on this branch. This commit drops the faulty assertion in 8.2. Relates elastic#86293

We introduced an assertion in #85131 to check the assumption that there are no pending non-local response handlers when the `TransportService` shuts down. We later discovered that this assertion rarely tripped, which we fixed in #86315, but that fix did not go into 8.2 so there are still rare failures on this branch. This commit drops the faulty assertion in 8.2. Relates #86293

DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v8.2.0 labels Mar 19, 2022

DaveCTurner requested a review from original-brownbear March 19, 2022 22:04

DaveCTurner mentioned this pull request Mar 19, 2022

LEAK resource not cleaned up RelocationIT testRelocationEstablishedPeerRecoveryRetentionLeases #84948

Closed

DaveCTurner added 5 commits March 19, 2022 22:23

Handlers don't change executor

3e24fb9

More unrealistic executor randomization

33c129b

Naming, comments, ...

a0f7129

May as well log the original exception too

0263ccc

Rename

b1eea15

DaveCTurner added 6 commits March 20, 2022 10:55

Ref-count to keep threadpools alive while direct handlers are in limbo

14b2bce

Better comment

2167a93

Revert

ee07b17

Revert

a93032f

Revert

e0a41cd

Properly fork the remaining node-local handlers too

eea289d

DaveCTurner added 2 commits March 21, 2022 00:46

Add test

b2278f5

Weaken assertion

e2662b0

DaveCTurner marked this pull request as ready for review March 21, 2022 08:14

elasticmachine added the Team:Distributed Meta label for distributed team. label Mar 21, 2022

DaveCTurner changed the title ~~Complete transport handlers at shutdown~~ Complete all transport handlers at shutdown Mar 21, 2022

DaveCTurner added 2 commits March 22, 2022 11:06

Merge branch 'master' into 2022-03-19-complete-transport-handlers-at-…

652fa23

…shutdown

Execution cannot fail, no need to handle success==false case

0dc2fe7

original-brownbear reviewed Mar 22, 2022

View reviewed changes

original-brownbear approved these changes Mar 22, 2022

View reviewed changes

DaveCTurner merged commit 8f9d2fa into elastic:master Mar 22, 2022

DaveCTurner deleted the 2022-03-19-complete-transport-handlers-at-shutdown branch March 22, 2022 17:25

DaveCTurner mentioned this pull request Mar 22, 2022

Fail test if Node#awaitClose times out #85238

Merged

DaveCTurner mentioned this pull request May 17, 2022

Drop faulty assertion in TransportService #86843

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete all transport handlers at shutdown#85131

Complete all transport handlers at shutdown#85131
DaveCTurner merged 16 commits intoelastic:masterfrom
DaveCTurner:2022-03-19-complete-transport-handlers-at-shutdown

DaveCTurner commented Mar 19, 2022 •

edited

Loading

Uh oh!

DaveCTurner commented Mar 19, 2022 •

edited

Loading

Uh oh!

DaveCTurner commented Mar 20, 2022 •

edited

Loading

Uh oh!

DaveCTurner commented Mar 20, 2022

Uh oh!

elasticmachine commented Mar 21, 2022

Uh oh!

original-brownbear left a comment

Uh oh!

original-brownbear Mar 22, 2022

Uh oh!

DaveCTurner Mar 22, 2022

Uh oh!

DaveCTurner Mar 22, 2022

Uh oh!

original-brownbear Mar 22, 2022

Uh oh!

DaveCTurner Mar 22, 2022

Uh oh!

original-brownbear left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

DaveCTurner commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Mar 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Mar 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Mar 20, 2022

Uh oh!

elasticmachine commented Mar 21, 2022

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Mar 22, 2022

Choose a reason for hiding this comment

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DaveCTurner commented Mar 19, 2022 •

edited

Loading

DaveCTurner commented Mar 19, 2022 •

edited

Loading

DaveCTurner commented Mar 20, 2022 •

edited

Loading