Skip to content

Fix Netty4ChunkedContinuationsIT#testClientCancellation#110118

Merged
nicktindall merged 8 commits intoelastic:mainfrom
nicktindall:fix/109866_investigate_chunked_continuations_test
Jun 27, 2024
Merged

Fix Netty4ChunkedContinuationsIT#testClientCancellation#110118
nicktindall merged 8 commits intoelastic:mainfrom
nicktindall:fix/109866_investigate_chunked_continuations_test

Conversation

@nicktindall
Copy link
Copy Markdown
Contributor

@nicktindall nicktindall commented Jun 25, 2024

Closes #109866

testClientCancellation had an issue where the test could fail if the cancellation happened before the mock action returned its response. This change tightens up the assertions to ensure that in the event we do create a chunked response, we eventually close it.

);
checkShutdown();
});
checkShutdown();
Copy link
Copy Markdown
Contributor Author

@nicktindall nicktindall Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've so far been unable to reproduce the issue locally or on a CI-like server, but it looks like the flakiness began after 0a008ed was merged.

It changed this block from:

public void onResponse(ChunkedRestResponseBodyPart continuation) {
channel.writeAndFlush(
new Netty4ChunkedHttpContinuation(writeSequence, continuation, finishingWrite.combiner()),
finishingWrite.onDone() // pass the terminal listener/promise along the line
);
checkShutdown();
}

to:

public void onResponse(ChunkedRestResponseBodyPart continuation) {
// always fork a fresh task to avoid stack overflow
assert Transports.assertDefaultThreadContext(threadContext);
channel.eventLoop()
.execute(
() -> channel.writeAndFlush(
new Netty4ChunkedHttpContinuation(writeSequence, continuation, finishingWrite.combiner()),
finishingWrite.onDone() // pass the terminal listener/promise along the line
)
);
checkShutdown();
}

The theory

So after the change, the call to checkShutdown() ensures that finishingWrite.onDone() is completed if the event loop shuts down after we make the call to channel.eventLoop().execute(...), but there's no similar check after the call to channel.writeAndFlush(...) that happens sometime later, and this can lead to leaks.

I believe we have safeWriteAndFlush as an option here, but it would have meant wrapping finishingWrite.onDone() in an ActionListener.

Or perhaps the checkShutdown() is not necessary because the writeAndFlush always runs on the transport_worker thread now? I don't fully understand the bug we're working around so I can't be sure.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent sometime looking into this. I feel this is the right change. But I am also in a similar situation that I am not sure whether the Netty bug (netty/netty#8007) is still relevant now that channel.writeAndFlush is always invoked within the eventloop. Maybe the bug can still happen? In that case, this would be the right fix.

Shorting of being certain, we synced and agreed to separate the additional tracking changes to test code out so that they can be reviewed and merged which will give us more debug info when it fails again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this theory was incorrect.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe if we're already on the event loop and we enqueue another task then that task will also execute before shutdown -- shutdown only happens once the task queue is completely drained. See e.g. how io.netty.util.concurrent.SingleThreadEventExecutor#confirmShutdown keeps on returning false while there are more tasks to run. I suspect if this were not true then we'd have to deal with it in lots more places than just this.

@nicktindall nicktindall changed the title WIP investigate/fix Netty4ChunkedContinuationsIT Investigate/fix Netty4ChunkedContinuationsIT Jun 25, 2024
@nicktindall nicktindall requested a review from ywangd June 25, 2024 07:47
@nicktindall nicktindall added :Distributed/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Jun 25, 2024
@nicktindall nicktindall marked this pull request as ready for review June 25, 2024 08:44
@elasticsearchmachine elasticsearchmachine added Team:Distributed Meta label for distributed team. needs:risk Requires assignment of a risk label (low, medium, blocker) labels Jun 25, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@nicktindall nicktindall changed the title Investigate/fix Netty4ChunkedContinuationsIT Fix Netty4ChunkedContinuationsIT.testClientCancellation Jun 25, 2024
@nicktindall nicktindall changed the title Fix Netty4ChunkedContinuationsIT.testClientCancellation Fix Netty4ChunkedContinuationsIT#testClientCancellation Jun 25, 2024
@nicktindall nicktindall added medium-risk An open issue or test failure that is a medium risk to future releases >test Issues or PRs that are addressing/adding tests and removed needs:risk Requires assignment of a risk label (low, medium, blocker) >test-failure Triaged test failures from CI medium-risk An open issue or test failure that is a medium risk to future releases labels Jun 25, 2024
@nicktindall
Copy link
Copy Markdown
Contributor Author

Raised #110175 in the hope that it might provide more info about the failure and validate the theory

…e_chunked_continuations_test

# Conflicts:
#	modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
client.execute(TYPE, new Request(), new RestActionListener<>(channel) {
@Override
protected void processResponse(Response response) {
localRefs.mustIncRef();
Copy link
Copy Markdown
Contributor Author

@nicktindall nicktindall Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at a failed build scan with the logging turned on, it became apparent that the unbalanced sequence of events went like (most recent first):

  1. BaseRestHandler closing the RestChannelConsumer (decRef)
    • at Netty4ChunkedContinuationsIT:641
  2. Closing the resource tracker (decRef)
    • at Netty4ChunkedContinuationsIT:324
  3. RestChannelConsumer.accept (incRef)
    • at Netty4ChunkedContinuationsIT:646
  4. prepareRequest (incRef)
    • at Netty4ChunkedContinuationsIT:637
  5. created (implicit incRef)
    • at Netty4ChunkedContinuationsIT.java:322

Which suggests to me the following sequence of events:

  1. RestChannelConsumer.accept is called (and mustIncRef is called)
  2. TransportInfiniteContinuationsAction.doExecute is called and schedules the return of the chunked response
  3. The request is cancelled, the channel is closed
  4. The chunked response is returned, RestActionListener#onResponse fails in the call to #ensureOpen() and ends up calling RestActionListener#onFailure instead of RestActionListener#processResponse so the decRef on line 655 is never called, and we have our inbalance.

Options for fixing

If we move the mustIncRef such that it only occurs when we receive the InfiniteContinuationsPlugin.Response from the action (as above), the test shouldn't be susceptible to such a failure, but it does indicate that the Response object can be created but never closed. I'm not sure if this is an actual bug. Response type parameter doesn't have any type constraints in RestActionListener so it's not necessarily Releasable. I'm not clear on what the contract is here and whether RestActionListener has a responsibility to try and close the Response when it fails due to #ensureOpen() throwing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new positioning of mustIncRef is OK, because instead of

"if we accepted the request" -> "then we close the chunked response"

we now assert

"if we created the chunked response" -> "then we close the chunked response"

But I may have missed something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other plugins (YieldsContinuationsPlugin at least) seem to have the same issue, but I guess it doesn't trigger because we wait for the request to complete when calling those.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job on tracking this down. I can confirm the test can fail with following diff

--- a/modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
+++ b/modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java
@@ -645,6 +645,12 @@ public class Netty4ChunkedContinuationsIT extends ESNetty4IntegTestCase {
                             public void accept(RestChannel channel) {
                                 localRefs.mustIncRef();
                                 client.execute(TYPE, new Request(), new RestActionListener<>(channel) {
+                                    @Override
+                                    protected void ensureOpen() {
+                                        safeSleep(500);
+                                        super.ensureOpen();
+                                    }
+
                                     @Override
                                     protected void processResponse(Response response) {
                                         channel.sendResponse(RestResponse.chunked(RestStatus.OK, response.getResponseBodyPart(), () -> {

I think this is the right fix we are doing the same thing in other RestActionListener implementation. 👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the analysis looks right to me. However I would still prefer that we didn't just leak the response listener in this case, that's not how production code should behave, and we should be able to assert that it's always completed. I opened #110309 with an alternative fix which I would prefer.

@nicktindall nicktindall removed the request for review from original-brownbear June 27, 2024 02:22
@nicktindall nicktindall requested a review from DaveCTurner June 27, 2024 02:22
Copy link
Copy Markdown
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@ywangd
Copy link
Copy Markdown
Member

ywangd commented Jun 27, 2024

Btw, maybe it's worth to update the PR description now that you got new findings and fix. Thanks!

@nicktindall nicktindall removed the request for review from DaveCTurner June 27, 2024 04:33
@nicktindall nicktindall merged commit 944f2da into elastic:main Jun 27, 2024
@nicktindall nicktindall deleted the fix/109866_investigate_chunked_continuations_test branch June 27, 2024 04:35
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jun 30, 2024
With the changes in elastic#109519 we now do one more async step while serving
the response, so we need to acquire another ref to track the new step.

Relates elastic#109866
Relates elastic#110118
Relates elastic#110175
Relates elastic#110249
DaveCTurner added a commit that referenced this pull request Jul 1, 2024
With the changes in #109519 we now do one more async step while serving
the response, so we need to acquire another ref to track the new step.

Relates #109866
Relates #110118
Relates #110175
Relates #110249
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v8.15.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] Netty4ChunkedContinuationsIT testClientCancellation failing

4 participants