Complete all transport handlers at shutdown#85131
Complete all transport handlers at shutdown#85131DaveCTurner merged 16 commits intoelastic:masterfrom
Conversation
Transport handlers may hold resources that need to be released, so we must complete them all to avoid leaks. Also, the completion of a handler may fork onto a different threadpool. Today we fork using a plain `Runnable` in various places, but this means that the completion may be rejected if the threadpool queue is full or the executor is shutting down. This commit changes the behaviour to force execution even if the queue is full, and to complete the handler directly if the target executor is shutting down. Closes elastic#84948
|
Labelling this Also it's a draft because I think it deserves some more specific tests of its own - just opening it now for discussion and to see if CI finds anything interesting. |
|
On reflection I don't really like the solution as of b1eea15. We should be able to fully close down the transport service (including completing any pending handlers) before stopping any thread pools, avoiding the need to complete handlers on |
|
@elasticmachine test this please |
|
Pinging @elastic/es-distributed (Team:Distributed) |
original-brownbear
left a comment
There was a problem hiding this comment.
Thanks David, sorry this took me a little to get my head around. One question only basically, looks great in general.
| final TransportResponseHandler<?> handler = service.responseHandlers.onResponseReceived(requestId, service); | ||
| // ignore if its null, the service logs it | ||
| if (handler != null) { | ||
| try (var shutdownBlock = service.pendingDirectHandlers.withRef()) { |
There was a problem hiding this comment.
I'm struggling with how this would work in case of a handler running on the generic pool. We increment the pending handlers here, then enqueue the task on the generic queue and then decrement.
Wouldn't we still complete the stop of the transport service now before that task has been executed and potentially just quietly kill the generic pool without ever running our ForkingResponseHandlerRunnable? If it's enqueued and shutdown is called on that queue, I don't think there's anything that notifies our task of the fact that it will never run?
Wouldn't we have to push the decrementing of the ref on shutdownBlock to the generic pool in this example to actually be 100% safe?
There was a problem hiding this comment.
My thinking is that it's enough to enqueue everything, we don't have to wait for them all to execute, because we start the shutdown by calling ThreadPool#shutdown() which should drain the threadpool queue first. It's true that this doesn't properly guarantee that everything executes (we call shutdownNow() after 10s regardless) but in practice this only matters for tests which should be cleaning everything up in good time anyway.
I suppose we could assert that shutdownNow() returns an empty list in most tests to be more explicit about that. Not sure I want to do that here tho.
There was a problem hiding this comment.
FWIW this mirrors the behaviour for handling remote responses which enqueue the work from the transport thread but don't guarantee that the handler is completed before the transport thread is stopped.
There was a problem hiding this comment.
I suppose we could assert that shutdownNow() returns an empty list in most tests to be more explicit about that. Not sure I want to do that here tho.
Right ... that's what we actually want in a follow-up. I'm 95% sure that will trip here and there ...
There was a problem hiding this comment.
Ah I checked and it looks like we almost have this already. We throw an AssertionError if graceful shutdown fails in ESSingleNodeTestCase already but apparently only an IOException in ESIntegTestCase. I think that would result in test failures already but for the avoidance of doubt I opened #85238 to make this a proper assertion.
original-brownbear
left a comment
There was a problem hiding this comment.
LGTM, makes sense to me thanks David! :)
We rely on nodes stopping gracefully in integ tests, rather than timing out and using `ThreadPool#shutdownNow` to ignore any remaining cleanup work. Today we throw an `IOException` if the shutdown wasn't graceful. With this commit we move to an `AssertionError` so we can be sure that any such problems result in a test failure. Relates elastic#85131
We rely on nodes stopping gracefully in integ tests, rather than timing out and using `ThreadPool#shutdownNow` to ignore any remaining cleanup work. Today we throw an `IOException` if the shutdown wasn't graceful. With this commit we move to an `AssertionError` so we can be sure that any such problems result in a test failure. Relates #85131
We introduced an assertion in elastic#85131 to check the assumption that there are no pending non-local response handlers when the `TransportService` shuts down. We later discovered that this assertion rarely tripped, which we fixed in elastic#86315, but that fix did not go into 8.2 so there are still rare failures on this branch. This commit drops the faulty assertion in 8.2. Relates elastic#86293
We introduced an assertion in #85131 to check the assumption that there are no pending non-local response handlers when the `TransportService` shuts down. We later discovered that this assertion rarely tripped, which we fixed in #86315, but that fix did not go into 8.2 so there are still rare failures on this branch. This commit drops the faulty assertion in 8.2. Relates #86293
Transport handlers may hold resources that need to be released, so we
must complete them all to avoid leaks. Also, the completion of a handler
may fork onto a different threadpool. Today we fork using a plain
Runnablein various places, but this means that the completion may berejected if the threadpool queue is full or the executor is shutting
down. This commit changes the behaviour to force execution even if the
queue is full, and to ensure that all handlers' completions are enqueued
with their respective executors before stopping the threadpool.
Closes #84948