Handle InternalSendException inline for non-forking handlers by ywangd · Pull Request #114375 · elastic/elasticsearch

ywangd · 2024-10-09T03:02:32Z

When TransportService fails to send a transport action, it can complete the listener's onFailure with the generic executor. If the listener is a PlainActionFuture and also waits to be completed with a generic thread, it will trip the assertCompleteAllowed assertion.

elasticsearch/server/src/main/java/org/elasticsearch/transport/TransportService.java

Lines 1062 to 1064 in fb482f8

    
           } else if (handlerExecutor == EsExecutors.DIRECT_EXECUTOR_SERVICE) { 
        
               // if the handler is non-forking then dispatch to GENERIC to avoid a possible stack overflow 
        
               return threadPool.generic();

With this PR, we no longer fork to the generic thread pool and instead just handle the exeption inline with the current thread. The expectation is that the downstream handler should take care potential stack overflow issues. This is similar to what is done in #109236

elasticsearchmachine · 2024-10-09T03:02:56Z

Pinging @elastic/es-distributed (Team:Distributed)

henningandersen

I think I'd rather avoid the dispatch, just like we did in:

#109236

I think the stack overflow breaking should be done in the handler instead if necessary.

ywangd · 2024-10-09T08:31:11Z

Thanks for the really helpful pointer. I'll repurpose this PR to follow your suggestion.

Please note that the issue may not be entirely fixed with the new approach. As said in the PR description, there is another failture path in TransportService#getConnectionOrFail which is before dispatching. We can address that separately.

…hen handling exception

ywangd · 2024-10-10T04:16:56Z

server/src/test/java/org/elasticsearch/transport/TransportServiceLifecycleTests.java

+    public void testInternalSendExceptionWithNonForkingResponseHandlerCanCompleteFutureWaiterFromGenericThread() throws Exception {
+        try (var nodeA = new TestNode("node-A")) {
+            final var future = new PlainActionFuture<TransportResponse.Empty>();
+            final var latch = new CountDownLatch(1);
+            nodeA.transportService.getThreadPool().generic().execute(() -> {
+                assertEquals("simulated exception in sendRequest", getSendRequestException(future, IOException.class).getMessage());
+                latch.countDown();
+            });
+            nodeA.transportService.sendRequest(
+                nodeA.getThrowingConnection(),
+                TestNode.randomActionName(random()),
+                new EmptyRequest(),
+                TransportRequestOptions.EMPTY,
+                new ActionListenerResponseHandler<>(future.delegateResponse((l, e) -> {
+                    assertThat(Thread.currentThread().getName(), startsWith("TEST-"));
+                    l.onFailure(e);
+                }), unusedReader(), EsExecutors.DIRECT_EXECUTOR_SERVICE)
+            );
+            assertBusy(() -> assertTrue(future.isDone()));
+        }


This is a more direct low-level test for the assertCompleteAllowed issue.

elasticsearchmachine · 2024-10-10T04:17:38Z

Hi @ywangd, I've created a changelog YAML for you.

ywangd · 2024-10-10T04:18:54Z

I updated the PR to avoid forking in handleInternalSendException. Also changed the label from :non-issue to :bug following #109236. This is now ready for another look. Thank you!

DaveCTurner

Change LGTM but I left some comments on the tests. FWIW I'd have written this test as follows:

    public void testInternalSendExceptionWithNonForkingResponseHandlerCanCompleteFutureWaiterFromGenericThread() {
        try (var nodeA = new TestNode("node-A")) {
            final var testThread = Thread.currentThread();
            assertEquals(
                "simulated exception in sendRequest",
                safeAwaitAndUnwrapFailure(
                    IOException.class,
                    TransportResponse.Empty.class,
                    l -> nodeA.transportService.sendRequest(
                        nodeA.getThrowingConnection(),
                        TestNode.randomActionName(random()),
                        new EmptyRequest(),
                        TransportRequestOptions.EMPTY,
                        new ActionListenerResponseHandler<>(
                            ActionListener.runBefore(l, () -> assertSame(testThread, Thread.currentThread())),
                            unusedReader(),
                            EsExecutors.DIRECT_EXECUTOR_SERVICE
                        )
                    )
                ).getMessage()
            );
        }
    }

DaveCTurner · 2024-10-10T07:42:04Z

server/src/test/java/org/elasticsearch/transport/TransportServiceLifecycleTests.java

+            final var latch = new CountDownLatch(1);
+            nodeA.transportService.getThreadPool().generic().execute(() -> {
+                assertEquals("simulated exception in sendRequest", getSendRequestException(future, IOException.class).getMessage());
+                latch.countDown();
+            });


Not sure what this bit of the test is doing - is it needed?

The original reason that leads to this change is that PlainActionFuture#assertCompleteAllowedfails inPlainActionFuture#onFailurewhen the waiter is with ageneric` thread. So I wanted to have this test to simulate that situation, i.e. waiting for future with a generic thread.

I do recognize that asserting the completing thread is currentThread instead of generic already proves the change is effective. But I added the above intending to preserve some context on how this issue is identified. If you think that's unnecessary, I can definitely rewrite the test as you suggested. Please let me know.

Sorry I just noticed the latch is indeed useless. Removed in ffaefbd

(this is an overall response right, not just related to the few lines at the top of this thread?)

Yeah I worry that this is a very indirect way to test that property, and it'd still be a problem if we called sendRequest on a (different) generic thread from the one that's waiting on the future. I'd rather we just focussed on the actual threading behaviour here in the context of the TransportService tests.

Yeah I worry that this is a very indirect way to test that property

I agree. It's in addition to the same thread assertion.

You are right that assertion can still be tripped if we send the request with the generic thread pool. This is already another problem that can happen in stress test (with a slightly different code path).

I pushed e90005d to rewrite the test as you suggested.

server/src/test/java/org/elasticsearch/transport/TransportServiceLifecycleTests.java

DaveCTurner

LGTM

ywangd · 2024-10-10T08:27:36Z

@elasticmachine update branch

ywangd · 2024-10-10T08:35:11Z

FYI, I plan to backport this PR since it's labelled as bug. Though no test failure has been identified in the stateful code base, it is still a possibility.

henningandersen

LGTM.

elasticsearchmachine · 2024-10-10T10:27:05Z

💚 Backport successful

Status	Branch	Result
✅	8.x

…#114375) When TransportService fails to send a transport action, it can complete the listener's `onFailure` with the `generic` executor. If the listener is a `PlainActionFuture` and also waits to be completed with a `generic` thread, it will trip the `assertCompleteAllowed` assertion. https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064 With this PR, we no longer fork to the generic thread pool and instead just handle the exeption inline with the current thread. The expectation is that the downstream handler should take care potential stack overflow issues. This is similar to what is done in elastic#109236

#114493) When TransportService fails to send a transport action, it can complete the listener's `onFailure` with the `generic` executor. If the listener is a `PlainActionFuture` and also waits to be completed with a `generic` thread, it will trip the `assertCompleteAllowed` assertion. https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064 With this PR, we no longer fork to the generic thread pool and instead just handle the exeption inline with the current thread. The expectation is that the downstream handler should take care potential stack overflow issues. This is similar to what is done in #109236

…#114375) When TransportService fails to send a transport action, it can complete the listener's `onFailure` with the `generic` executor. If the listener is a `PlainActionFuture` and also waits to be completed with a `generic` thread, it will trip the `assertCompleteAllowed` assertion. https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064 With this PR, we no longer fork to the generic thread pool and instead just handle the exeption inline with the current thread. The expectation is that the downstream handler should take care potential stack overflow issues. This is similar to what is done in elastic#109236

…tic#2966) We do not pass down the executor for the GetVBCCChunk action so that we can retain the bytes on the transport thread. The response is then fulfilled with the dedicated fill_vbcc executor. There is no such concern on the failure path since no bytes are available. With this PR, we fork to the same thread pool before completing the listener exceptionally. This helps avoid occassional PlainActionFture#assertCompleteAllowed failures. Relates: elastic#114375 Relates: elastic#2933

There are a few edge cases where closing a node can causes test failures: Closing the handling node when looking up master node name. Closing the coordinating node when a search is ongoing. This can lead to leaking search context in MockSearchService on the data nodes. Closing the data node when a search is ongoing. This can lead to leaking resource on the coordinating node. This PR fixes 1 by avoiding lookup since the master node does not change and is already known. It fixes 2 by always uses master node as the coordinating node. It fixes 3 by avoid restarting search node. With these changes in place (along with elastic#2790, elastic#2966, elastic#2983, elastic#112748, elastic#114375) the test is stable enough (running in a loop for 40+ hours without failure) to be unmuted. Resolves: elastic#2327 Resolves:

…tic#2966) We do not pass down the executor for the GetVBCCChunk action so that we can retain the bytes on the transport thread. The response is then fulfilled with the dedicated fill_vbcc executor. There is no such concern on the failure path since no bytes are available. With this PR, we fork to the same thread pool before completing the listener exceptionally. This helps avoid occassional PlainActionFture#assertCompleteAllowed failures. Relates: elastic#114375 Relates: elastic#2933

There are a few edge cases where closing a node can causes test failures: Closing the handling node when looking up master node name. Closing the coordinating node when a search is ongoing. This can lead to leaking search context in MockSearchService on the data nodes. Closing the data node when a search is ongoing. This can lead to leaking resource on the coordinating node. This PR fixes 1 by avoiding lookup since the master node does not change and is already known. It fixes 2 by always uses master node as the coordinating node. It fixes 3 by avoid restarting search node. With these changes in place (along with elastic#2790, elastic#2966, elastic#2983, elastic#112748, elastic#114375) the test is stable enough (running in a loop for 40+ hours without failure) to be unmuted. Resolves: elastic#2327 Resolves:

always allow generic executor to complete future exceptionally

12f4180

ywangd added >non-issue :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v9.0.0 labels Oct 9, 2024

ywangd requested review from arteam, henningandersen and kingherc October 9, 2024 03:02

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Oct 9, 2024

npe

d93212d

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Oct 9, 2024

henningandersen reviewed Oct 9, 2024

View reviewed changes

do not fork to generic thread pool for non-forking response handler w…

37d520c

…hen handling exception

ywangd changed the title ~~Always allow generic executor to complete future exceptionally~~ Handle InternalSendException inline for non-forking handlers Oct 10, 2024

ywangd commented Oct 10, 2024

View reviewed changes

ywangd requested a review from henningandersen October 10, 2024 04:17

ywangd added >bug and removed >non-issue labels Oct 10, 2024

Update docs/changelog/114375.yaml

f012708

ywangd added :Distributed/Network Http and internode communication implementations and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Oct 10, 2024

DaveCTurner reviewed Oct 10, 2024

View reviewed changes

ywangd added 2 commits October 10, 2024 19:07

review feedback

ffaefbd

rewrite test as suggested

e90005d

DaveCTurner reviewed Oct 10, 2024

View reviewed changes

server/src/test/java/org/elasticsearch/transport/TransportServiceLifecycleTests.java Outdated Show resolved Hide resolved

server/src/test/java/org/elasticsearch/transport/TransportServiceLifecycleTests.java Outdated Show resolved Hide resolved

DaveCTurner approved these changes Oct 10, 2024

View reviewed changes

Merge branch 'main' into future-onfailure-with-generic-executor

f7fa4e9

ywangd removed the serverless-linked Added by automation, don't add manually label Oct 10, 2024

ywangd added v8.16.0 auto-backport Automatically create backport pull requests when merged labels Oct 10, 2024

henningandersen approved these changes Oct 10, 2024

View reviewed changes

spotless

6c4f4e5

ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 10, 2024

elasticsearchmachine merged commit c27bc08 into elastic:main Oct 10, 2024

ywangd deleted the future-onfailure-with-generic-executor branch October 10, 2024 10:26

ywangd mentioned this pull request Oct 10, 2024

[8.x] Handle InternalSendException inline for non-forking handlers (#114375) #114493

Merged

	} else if (handlerExecutor == EsExecutors.DIRECT_EXECUTOR_SERVICE) {
	// if the handler is non-forking then dispatch to GENERIC to avoid a possible stack overflow
	return threadPool.generic();

Conversation

ywangd commented Oct 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Oct 9, 2024

Uh oh!

henningandersen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd commented Oct 9, 2024

Uh oh!

ywangd Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Oct 10, 2024

Uh oh!

ywangd commented Oct 10, 2024

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

ywangd Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd commented Oct 10, 2024

Uh oh!

ywangd commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henningandersen left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Oct 10, 2024

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ywangd commented Oct 9, 2024 •

edited

Loading

henningandersen left a comment •

edited

Loading

ywangd commented Oct 10, 2024 •

edited

Loading