Skip to content

Complete all transport handlers at shutdown#85131

Merged
DaveCTurner merged 16 commits intoelastic:masterfrom
DaveCTurner:2022-03-19-complete-transport-handlers-at-shutdown
Mar 22, 2022
Merged

Complete all transport handlers at shutdown#85131
DaveCTurner merged 16 commits intoelastic:masterfrom
DaveCTurner:2022-03-19-complete-transport-handlers-at-shutdown

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner commented Mar 19, 2022

Transport handlers may hold resources that need to be released, so we
must complete them all to avoid leaks. Also, the completion of a handler
may fork onto a different threadpool. Today we fork using a plain
Runnable in various places, but this means that the completion may be
rejected if the threadpool queue is full or the executor is shutting
down. This commit changes the behaviour to force execution even if the
queue is full, and to ensure that all handlers' completions are enqueued
with their respective executors before stopping the threadpool.

Closes #84948

Transport handlers may hold resources that need to be released, so we
must complete them all to avoid leaks. Also, the completion of a handler
may fork onto a different threadpool. Today we fork using a plain
`Runnable` in various places, but this means that the completion may be
rejected if the threadpool queue is full or the executor is shutting
down. This commit changes the behaviour to force execution even if the
queue is full, and to complete the handler directly if the target
executor is shutting down.

Closes elastic#84948
@DaveCTurner DaveCTurner added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v8.2.0 labels Mar 19, 2022
@DaveCTurner
Copy link
Copy Markdown
Member Author

DaveCTurner commented Mar 19, 2022

Labelling this >test because AFAICT today we don't appear to use any thread pools with bounded queues here so the only possibility of a leak is at shutdown and that doesn't really matter in production since we're shutting down anyway.

Also it's a draft because I think it deserves some more specific tests of its own - just opening it now for discussion and to see if CI finds anything interesting.

@DaveCTurner
Copy link
Copy Markdown
Member Author

DaveCTurner commented Mar 20, 2022

On reflection I don't really like the solution as of b1eea15. We should be able to fully close down the transport service (including completing any pending handlers) before stopping any thread pools, avoiding the need to complete handlers on SAME at all. The trouble appears to be that we remove handlers from the map before queueing up their completion so we aren't waiting for these. I suspect it's ok for remote responses because the acquire-handler-and-enqueue-complete flow happens on a transport thread and we wait for these threads to finish when shutting the transport service down, but that's not the case for direct responses which therefore might race against shutdown. Maybe we need a little bit of extra ref-counting to track these things. See 14b2bce.

@DaveCTurner
Copy link
Copy Markdown
Member Author

@elasticmachine test this please

@DaveCTurner DaveCTurner marked this pull request as ready for review March 21, 2022 08:14
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team. label Mar 21, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner DaveCTurner changed the title Complete transport handlers at shutdown Complete all transport handlers at shutdown Mar 21, 2022
Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David, sorry this took me a little to get my head around. One question only basically, looks great in general.

final TransportResponseHandler<?> handler = service.responseHandlers.onResponseReceived(requestId, service);
// ignore if its null, the service logs it
if (handler != null) {
try (var shutdownBlock = service.pendingDirectHandlers.withRef()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling with how this would work in case of a handler running on the generic pool. We increment the pending handlers here, then enqueue the task on the generic queue and then decrement.
Wouldn't we still complete the stop of the transport service now before that task has been executed and potentially just quietly kill the generic pool without ever running our ForkingResponseHandlerRunnable? If it's enqueued and shutdown is called on that queue, I don't think there's anything that notifies our task of the fact that it will never run?
Wouldn't we have to push the decrementing of the ref on shutdownBlock to the generic pool in this example to actually be 100% safe?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that it's enough to enqueue everything, we don't have to wait for them all to execute, because we start the shutdown by calling ThreadPool#shutdown() which should drain the threadpool queue first. It's true that this doesn't properly guarantee that everything executes (we call shutdownNow() after 10s regardless) but in practice this only matters for tests which should be cleaning everything up in good time anyway.

I suppose we could assert that shutdownNow() returns an empty list in most tests to be more explicit about that. Not sure I want to do that here tho.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this mirrors the behaviour for handling remote responses which enqueue the work from the transport thread but don't guarantee that the handler is completed before the transport thread is stopped.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could assert that shutdownNow() returns an empty list in most tests to be more explicit about that. Not sure I want to do that here tho.

Right ... that's what we actually want in a follow-up. I'm 95% sure that will trip here and there ...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I checked and it looks like we almost have this already. We throw an AssertionError if graceful shutdown fails in ESSingleNodeTestCase already but apparently only an IOException in ESIntegTestCase. I think that would result in test failures already but for the avoidance of doubt I opened #85238 to make this a proper assertion.

Copy link
Copy Markdown
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, makes sense to me thanks David! :)

@DaveCTurner DaveCTurner merged commit 8f9d2fa into elastic:master Mar 22, 2022
@DaveCTurner DaveCTurner deleted the 2022-03-19-complete-transport-handlers-at-shutdown branch March 22, 2022 17:25
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Mar 22, 2022
We rely on nodes stopping gracefully in integ tests, rather than timing
out and using `ThreadPool#shutdownNow` to ignore any remaining cleanup
work. Today we throw an `IOException` if the shutdown wasn't graceful.
With this commit we move to an `AssertionError` so we can be sure that
any such problems result in a test failure.

Relates elastic#85131
elasticsearchmachine pushed a commit that referenced this pull request Mar 22, 2022
We rely on nodes stopping gracefully in integ tests, rather than timing
out and using `ThreadPool#shutdownNow` to ignore any remaining cleanup
work. Today we throw an `IOException` if the shutdown wasn't graceful.
With this commit we move to an `AssertionError` so we can be sure that
any such problems result in a test failure.

Relates #85131
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request May 17, 2022
We introduced an assertion in elastic#85131 to check the assumption that there
are no pending non-local response handlers when the `TransportService`
shuts down. We later discovered that this assertion rarely tripped,
which we fixed in elastic#86315, but that fix did not go into 8.2 so there are
still rare failures on this branch. This commit drops the faulty
assertion in 8.2.

Relates elastic#86293
elasticsearchmachine pushed a commit that referenced this pull request May 17, 2022
We introduced an assertion in #85131 to check the assumption that there
are no pending non-local response handlers when the `TransportService`
shuts down. We later discovered that this assertion rarely tripped,
which we fixed in #86315, but that fix did not go into 8.2 so there are
still rare failures on this branch. This commit drops the faulty
assertion in 8.2.

Relates #86293
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v8.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LEAK resource not cleaned up RelocationIT testRelocationEstablishedPeerRecoveryRetentionLeases

3 participants