Fix Netty4ChunkedContinuationsIT#testClientCancellation by nicktindall · Pull Request #110118 · elastic/elasticsearch

nicktindall · 2024-06-25T06:57:04Z

testClientCancellation had an issue where the test could fail if the cancellation happened before the mock action returned its response. This change tightens up the assertions to ensure that in the event we do create a chunked response, we eventually close it.

...src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java

nicktindall · 2024-06-25T07:45:03Z

...ransport-netty4/src/main/java/org/elasticsearch/http/netty4/Netty4HttpPipeliningHandler.java

                                );
+                                checkShutdown();
+                            });
                            checkShutdown();


I've so far been unable to reproduce the issue locally or on a CI-like server, but it looks like the flakiness began after 0a008ed was merged.

It changed this block from:

elasticsearch/modules/transport-netty4/src/main/java/org/elasticsearch/http/netty4/Netty4HttpPipeliningHandler.java

Lines 277 to 283 in 683245e

public void onResponse(ChunkedRestResponseBodyPart continuation) {

channel.writeAndFlush(

new Netty4ChunkedHttpContinuation(writeSequence, continuation, finishingWrite.combiner()),

finishingWrite.onDone() // pass the terminal listener/promise along the line

);

checkShutdown();

}

to:

elasticsearch/modules/transport-netty4/src/main/java/org/elasticsearch/http/netty4/Netty4HttpPipeliningHandler.java

Lines 283 to 294 in 0a008ed

public void onResponse(ChunkedRestResponseBodyPart continuation) {

// always fork a fresh task to avoid stack overflow

assert Transports.assertDefaultThreadContext(threadContext);

channel.eventLoop()

.execute(

() -> channel.writeAndFlush(

new Netty4ChunkedHttpContinuation(writeSequence, continuation, finishingWrite.combiner()),

finishingWrite.onDone() // pass the terminal listener/promise along the line

)

);

checkShutdown();

}

The theory

So after the change, the call to checkShutdown() ensures that finishingWrite.onDone() is completed if the event loop shuts down after we make the call to channel.eventLoop().execute(...), but there's no similar check after the call to channel.writeAndFlush(...) that happens sometime later, and this can lead to leaks.

I believe we have safeWriteAndFlush as an option here, but it would have meant wrapping finishingWrite.onDone() in an ActionListener.

Or perhaps the checkShutdown() is not necessary because the writeAndFlush always runs on the transport_worker thread now? I don't fully understand the bug we're working around so I can't be sure.

I spent sometime looking into this. I feel this is the right change. But I am also in a similar situation that I am not sure whether the Netty bug (netty/netty#8007) is still relevant now that channel.writeAndFlush is always invoked within the eventloop. Maybe the bug can still happen? In that case, this would be the right fix.

Shorting of being certain, we synced and agreed to separate the additional tracking changes to test code out so that they can be reviewed and merged which will give us more debug info when it fails again.

It turns out this theory was incorrect.

Yes, I believe if we're already on the event loop and we enqueue another task then that task will also execute before shutdown -- shutdown only happens once the task queue is completely drained. See e.g. how io.netty.util.concurrent.SingleThreadEventExecutor#confirmShutdown keeps on returning false while there are more tasks to run. I suspect if this were not true then we'd have to deal with it in lots more places than just this.

elasticsearchmachine · 2024-06-25T08:45:08Z

Pinging @elastic/es-distributed (Team:Distributed)

…_test

nicktindall · 2024-06-26T03:20:30Z

Raised #110175 in the hope that it might provide more info about the failure and validate the theory

…e_chunked_continuations_test # Conflicts: # modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java

This reverts commit b0a9cb6.

…e_chunked_continuations_test

nicktindall · 2024-06-27T01:59:25Z

...src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java

                                client.execute(TYPE, new Request(), new RestActionListener<>(channel) {
                                    @Override
                                    protected void processResponse(Response response) {
+                                        localRefs.mustIncRef();


After looking at a failed build scan with the logging turned on, it became apparent that the unbalanced sequence of events went like (most recent first):

BaseRestHandler closing the RestChannelConsumer (decRef)

at Netty4ChunkedContinuationsIT:641

Closing the resource tracker (decRef)

at Netty4ChunkedContinuationsIT:324

RestChannelConsumer.accept (incRef)

at Netty4ChunkedContinuationsIT:646

prepareRequest (incRef)

at Netty4ChunkedContinuationsIT:637

created (implicit incRef)

at Netty4ChunkedContinuationsIT.java:322

Which suggests to me the following sequence of events:

RestChannelConsumer.accept is called (and mustIncRef is called)

TransportInfiniteContinuationsAction.doExecute is called and schedules the return of the chunked response

The request is cancelled, the channel is closed

The chunked response is returned, RestActionListener#onResponse fails in the call to #ensureOpen() and ends up calling RestActionListener#onFailure instead of RestActionListener#processResponse so the decRef on line 655 is never called, and we have our inbalance.

Options for fixing

If we move the mustIncRef such that it only occurs when we receive the InfiniteContinuationsPlugin.Response from the action (as above), the test shouldn't be susceptible to such a failure, but it does indicate that the Response object can be created but never closed. I'm not sure if this is an actual bug. Response type parameter doesn't have any type constraints in RestActionListener so it's not necessarily Releasable. I'm not clear on what the contract is here and whether RestActionListener has a responsibility to try and close the Response when it fails due to #ensureOpen() throwing.

I think the new positioning of mustIncRef is OK, because instead of

"if we accepted the request" -> "then we close the chunked response"

we now assert

"if we created the chunked response" -> "then we close the chunked response"

But I may have missed something.

The other plugins (YieldsContinuationsPlugin at least) seem to have the same issue, but I guess it doesn't trigger because we wait for the request to complete when calling those.

Good job on tracking this down. I can confirm the test can fail with following diff

--- a/modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java +++ b/modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java @@ -645,6 +645,12 @@ public class Netty4ChunkedContinuationsIT extends ESNetty4IntegTestCase { public void accept(RestChannel channel) { localRefs.mustIncRef(); client.execute(TYPE, new Request(), new RestActionListener<>(channel) { + @Override + protected void ensureOpen() { + safeSleep(500); + super.ensureOpen(); + } + @Override protected void processResponse(Response response) { channel.sendResponse(RestResponse.chunked(RestStatus.OK, response.getResponseBodyPart(), () -> {

I think this is the right fix we are doing the same thing in other RestActionListener implementation. 👍

Good catch, the analysis looks right to me. However I would still prefer that we didn't just leak the response listener in this case, that's not how production code should behave, and we should be able to assert that it's always completed. I opened #110309 with an alternative fix which I would prefer.

ywangd

LGTM 👍

ywangd · 2024-06-27T04:16:00Z

Btw, maybe it's worth to update the PR description now that you got new findings and fix. Thanks!

With the changes in elastic#109519 we now do one more async step while serving the response, so we need to acquire another ref to track the new step. Relates elastic#109866 Relates elastic#110118 Relates elastic#110175 Relates elastic#110249

With the changes in #109519 we now do one more async step while serving the response, so we need to acquire another ref to track the new step. Relates #109866 Relates #110118 Relates #110175 Relates #110249

Add leak tracking and re-enable chunked continuation test

8509f44

elasticsearchmachine added the v8.15.0 label Jun 25, 2024

Fix suspected issue

b0a9cb6

nicktindall commented Jun 25, 2024

View reviewed changes

...src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java Outdated Show resolved Hide resolved

nicktindall commented Jun 25, 2024

View reviewed changes

nicktindall changed the title ~~WIP investigate/fix Netty4ChunkedContinuationsIT~~ Investigate/fix Netty4ChunkedContinuationsIT Jun 25, 2024

nicktindall requested a review from ywangd June 25, 2024 07:47

nicktindall added :Distributed/Network Http and internode communication implementations >test-failure Triaged test failures from CI labels Jun 25, 2024

nicktindall marked this pull request as ready for review June 25, 2024 08:44

elasticsearchmachine added Team:Distributed Meta label for distributed team. needs:risk Requires assignment of a risk label (low, medium, blocker) labels Jun 25, 2024

nicktindall requested a review from original-brownbear June 25, 2024 08:47

nicktindall changed the title ~~Investigate/fix Netty4ChunkedContinuationsIT~~ Fix Netty4ChunkedContinuationsIT.testClientCancellation Jun 25, 2024

nicktindall changed the title ~~Fix Netty4ChunkedContinuationsIT.testClientCancellation~~ Fix Netty4ChunkedContinuationsIT#testClientCancellation Jun 25, 2024

Merge branch 'main' into fix/109866_investigate_chunked_continuations…

d4330f9

…_test

nicktindall added 5 commits June 26, 2024 15:25

Merge remote-tracking branch 'origin/main' into fix/109866_investigat…

818b074

…e_chunked_continuations_test # Conflicts: # modules/transport-netty4/src/internalClusterTest/java/org/elasticsearch/http/netty4/Netty4ChunkedContinuationsIT.java

Revert "Fix suspected issue"

92388e2

This reverts commit b0a9cb6.

Fix Netty4ChunkedContinuationsIT#testClientCancellation

30c2001

Merge remote-tracking branch 'origin/main' into fix/109866_investigat…

13f4496

…e_chunked_continuations_test

Unmute test again

8561e53

nicktindall commented Jun 27, 2024

View reviewed changes

nicktindall removed the request for review from original-brownbear June 27, 2024 02:22

nicktindall requested a review from DaveCTurner June 27, 2024 02:22

ywangd approved these changes Jun 27, 2024

View reviewed changes

nicktindall removed the request for review from DaveCTurner June 27, 2024 04:33

nicktindall merged commit 944f2da into elastic:main Jun 27, 2024

nicktindall deleted the fix/109866_investigate_chunked_continuations_test branch June 27, 2024 04:35

nicktindall mentioned this pull request Jun 27, 2024

Handle response correctly when request already cancelled #110249

Merged

DaveCTurner mentioned this pull request Jun 30, 2024

Improve refcounting in testClientCancellation #110309

Merged

	public void onResponse(ChunkedRestResponseBodyPart continuation) {
	channel.writeAndFlush(
	new Netty4ChunkedHttpContinuation(writeSequence, continuation, finishingWrite.combiner()),
	finishingWrite.onDone() // pass the terminal listener/promise along the line
	);
	checkShutdown();
	}

	public void onResponse(ChunkedRestResponseBodyPart continuation) {
	// always fork a fresh task to avoid stack overflow
	assert Transports.assertDefaultThreadContext(threadContext);
	channel.eventLoop()
	.execute(
	() -> channel.writeAndFlush(
	new Netty4ChunkedHttpContinuation(writeSequence, continuation, finishingWrite.combiner()),
	finishingWrite.onDone() // pass the terminal listener/promise along the line
	)
	);
	checkShutdown();
	}

Conversation

nicktindall commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nicktindall Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

The theory

Uh oh!

ywangd Jun 26, 2024

Choose a reason for hiding this comment

Uh oh!

nicktindall Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jun 30, 2024

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jun 25, 2024

Uh oh!

nicktindall commented Jun 26, 2024

Uh oh!

nicktindall Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Options for fixing

Uh oh!

nicktindall Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

nicktindall Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

ywangd Jun 27, 2024

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jun 30, 2024

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Jun 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nicktindall commented Jun 25, 2024 •

edited

Loading

nicktindall Jun 25, 2024 •

edited

Loading

nicktindall Jun 27, 2024 •

edited

Loading