Fix Netty4ChunkedContinuationsIT#testClientCancellation#110118
Conversation
...src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
Outdated
Show resolved
Hide resolved
| ); | ||
| checkShutdown(); | ||
| }); | ||
| checkShutdown(); |
There was a problem hiding this comment.
I've so far been unable to reproduce the issue locally or on a CI-like server, but it looks like the flakiness began after 0a008ed was merged.
It changed this block from:
to:
The theory
So after the change, the call to checkShutdown() ensures that finishingWrite.onDone() is completed if the event loop shuts down after we make the call to channel.eventLoop().execute(...), but there's no similar check after the call to channel.writeAndFlush(...) that happens sometime later, and this can lead to leaks.
I believe we have safeWriteAndFlush as an option here, but it would have meant wrapping finishingWrite.onDone() in an ActionListener.
Or perhaps the checkShutdown() is not necessary because the writeAndFlush always runs on the transport_worker thread now? I don't fully understand the bug we're working around so I can't be sure.
There was a problem hiding this comment.
I spent sometime looking into this. I feel this is the right change. But I am also in a similar situation that I am not sure whether the Netty bug (netty/netty#8007) is still relevant now that channel.writeAndFlush is always invoked within the eventloop. Maybe the bug can still happen? In that case, this would be the right fix.
Shorting of being certain, we synced and agreed to separate the additional tracking changes to test code out so that they can be reviewed and merged which will give us more debug info when it fails again.
There was a problem hiding this comment.
It turns out this theory was incorrect.
There was a problem hiding this comment.
Yes, I believe if we're already on the event loop and we enqueue another task then that task will also execute before shutdown -- shutdown only happens once the task queue is completely drained. See e.g. how io.netty.util.concurrent.SingleThreadEventExecutor#confirmShutdown keeps on returning false while there are more tasks to run. I suspect if this were not true then we'd have to deal with it in lots more places than just this.
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Raised #110175 in the hope that it might provide more info about the failure and validate the theory |
…e_chunked_continuations_test # Conflicts: # modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
This reverts commit b0a9cb6.
…e_chunked_continuations_test
| client.execute(TYPE, new Request(), new RestActionListener<>(channel) { | ||
| @Override | ||
| protected void processResponse(Response response) { | ||
| localRefs.mustIncRef(); |
There was a problem hiding this comment.
After looking at a failed build scan with the logging turned on, it became apparent that the unbalanced sequence of events went like (most recent first):
- BaseRestHandler closing the RestChannelConsumer (decRef)
- at
Netty4ChunkedContinuationsIT:641
- at
- Closing the resource tracker (decRef)
- at
Netty4ChunkedContinuationsIT:324
- at
- RestChannelConsumer.accept (incRef)
- at
Netty4ChunkedContinuationsIT:646
- at
- prepareRequest (incRef)
- at
Netty4ChunkedContinuationsIT:637
- at
- created (implicit incRef)
- at
Netty4ChunkedContinuationsIT.java:322
- at
Which suggests to me the following sequence of events:
RestChannelConsumer.acceptis called (andmustIncRefis called)TransportInfiniteContinuationsAction.doExecuteis called and schedules the return of the chunked response- The request is cancelled, the channel is closed
- The chunked response is returned,
RestActionListener#onResponsefails in the call to#ensureOpen()and ends up callingRestActionListener#onFailureinstead ofRestActionListener#processResponseso thedecRefon line 655 is never called, and we have our inbalance.
Options for fixing
If we move the mustIncRef such that it only occurs when we receive the InfiniteContinuationsPlugin.Response from the action (as above), the test shouldn't be susceptible to such a failure, but it does indicate that the Response object can be created but never closed. I'm not sure if this is an actual bug. Response type parameter doesn't have any type constraints in RestActionListener so it's not necessarily Releasable. I'm not clear on what the contract is here and whether RestActionListener has a responsibility to try and close the Response when it fails due to #ensureOpen() throwing.
There was a problem hiding this comment.
I think the new positioning of mustIncRef is OK, because instead of
"if we accepted the request" -> "then we close the chunked response"
we now assert
"if we created the chunked response" -> "then we close the chunked response"
But I may have missed something.
There was a problem hiding this comment.
The other plugins (YieldsContinuationsPlugin at least) seem to have the same issue, but I guess it doesn't trigger because we wait for the request to complete when calling those.
There was a problem hiding this comment.
Good job on tracking this down. I can confirm the test can fail with following diff
--- a/modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
+++ b/modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
@@ -645,6 +645,12 @@ public class Netty4ChunkedContinuationsIT extends ESNetty4IntegTestCase {
public void accept(RestChannel channel) {
localRefs.mustIncRef();
client.execute(TYPE, new Request(), new RestActionListener<>(channel) {
+ @Override
+ protected void ensureOpen() {
+ safeSleep(500);
+ super.ensureOpen();
+ }
+
@Override
protected void processResponse(Response response) {
channel.sendResponse(RestResponse.chunked(RestStatus.OK, response.getResponseBodyPart(), () -> {I think this is the right fix we are doing the same thing in other RestActionListener implementation. 👍
There was a problem hiding this comment.
Good catch, the analysis looks right to me. However I would still prefer that we didn't just leak the response listener in this case, that's not how production code should behave, and we should be able to assert that it's always completed. I opened #110309 with an alternative fix which I would prefer.
|
Btw, maybe it's worth to update the PR description now that you got new findings and fix. Thanks! |
With the changes in elastic#109519 we now do one more async step while serving the response, so we need to acquire another ref to track the new step. Relates elastic#109866 Relates elastic#110118 Relates elastic#110175 Relates elastic#110249
Closes #109866
testClientCancellationhad an issue where the test could fail if the cancellation happened before the mock action returned its response. This change tightens up the assertions to ensure that in the event we do create a chunked response, we eventually close it.