Skip to content

Handle InternalSendException inline for non-forking handlers#114375

Merged
elasticsearchmachine merged 8 commits intoelastic:mainfrom
ywangd:future-onfailure-with-generic-executor
Oct 10, 2024
Merged

Handle InternalSendException inline for non-forking handlers#114375
elasticsearchmachine merged 8 commits intoelastic:mainfrom
ywangd:future-onfailure-with-generic-executor

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Oct 9, 2024

When TransportService fails to send a transport action, it can complete the listener's onFailure with the generic executor. If the listener is a PlainActionFuture and also waits to be completed with a generic thread, it will trip the assertCompleteAllowed assertion.

} else if (handlerExecutor == EsExecutors.DIRECT_EXECUTOR_SERVICE) {
// if the handler is non-forking then dispatch to GENERIC to avoid a possible stack overflow
return threadPool.generic();

With this PR, we no longer fork to the generic thread pool and instead just handle the exeption inline with the current thread. The expectation is that the downstream handler should take care potential stack overflow issues. This is similar to what is done in #109236

@ywangd ywangd added >non-issue :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. v9.0.0 labels Oct 9, 2024
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Oct 9, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Oct 9, 2024
Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd rather avoid the dispatch, just like we did in:

#109236

I think the stack overflow breaking should be done in the handler instead if necessary.

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Oct 9, 2024

Thanks for the really helpful pointer. I'll repurpose this PR to follow your suggestion.

Please note that the issue may not be entirely fixed with the new approach. As said in the PR description, there is another failture path in TransportService#getConnectionOrFail which is before dispatching. We can address that separately.

@ywangd ywangd changed the title Always allow generic executor to complete future exceptionally Handle InternalSendException inline for non-forking handlers Oct 10, 2024
Comment on lines +178 to +197
public void testInternalSendExceptionWithNonForkingResponseHandlerCanCompleteFutureWaiterFromGenericThread() throws Exception {
try (var nodeA = new TestNode("node-A")) {
final var future = new PlainActionFuture<TransportResponse.Empty>();
final var latch = new CountDownLatch(1);
nodeA.transportService.getThreadPool().generic().execute(() -> {
assertEquals("simulated exception in sendRequest", getSendRequestException(future, IOException.class).getMessage());
latch.countDown();
});
nodeA.transportService.sendRequest(
nodeA.getThrowingConnection(),
TestNode.randomActionName(random()),
new EmptyRequest(),
TransportRequestOptions.EMPTY,
new ActionListenerResponseHandler<>(future.delegateResponse((l, e) -> {
assertThat(Thread.currentThread().getName(), startsWith("TEST-"));
l.onFailure(e);
}), unusedReader(), EsExecutors.DIRECT_EXECUTOR_SERVICE)
);
assertBusy(() -> assertTrue(future.isDone()));
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a more direct low-level test for the assertCompleteAllowed issue.

@ywangd ywangd added >bug and removed >non-issue labels Oct 10, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Oct 10, 2024

I updated the PR to avoid forking in handleInternalSendException. Also changed the label from :non-issue to :bug following #109236. This is now ready for another look. Thank you!

@ywangd ywangd added :Distributed/Network Http and internode communication implementations and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Oct 10, 2024
Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM but I left some comments on the tests. FWIW I'd have written this test as follows:

    public void testInternalSendExceptionWithNonForkingResponseHandlerCanCompleteFutureWaiterFromGenericThread() {
        try (var nodeA = new TestNode("node-A")) {
            final var testThread = Thread.currentThread();
            assertEquals(
                "simulated exception in sendRequest",
                safeAwaitAndUnwrapFailure(
                    IOException.class,
                    TransportResponse.Empty.class,
                    l -> nodeA.transportService.sendRequest(
                        nodeA.getThrowingConnection(),
                        TestNode.randomActionName(random()),
                        new EmptyRequest(),
                        TransportRequestOptions.EMPTY,
                        new ActionListenerResponseHandler<>(
                            ActionListener.runBefore(l, () -> assertSame(testThread, Thread.currentThread())),
                            unusedReader(),
                            EsExecutors.DIRECT_EXECUTOR_SERVICE
                        )
                    )
                ).getMessage()
            );
        }
    }

Comment on lines +181 to +185
final var latch = new CountDownLatch(1);
nodeA.transportService.getThreadPool().generic().execute(() -> {
assertEquals("simulated exception in sendRequest", getSendRequestException(future, IOException.class).getMessage());
latch.countDown();
});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this bit of the test is doing - is it needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original reason that leads to this change is that PlainActionFuture#assertCompleteAllowedfails inPlainActionFuture#onFailurewhen the waiter is with ageneric` thread. So I wanted to have this test to simulate that situation, i.e. waiting for future with a generic thread.

I do recognize that asserting the completing thread is currentThread instead of generic already proves the change is effective. But I added the above intending to preserve some context on how this issue is identified. If you think that's unnecessary, I can definitely rewrite the test as you suggested. Please let me know.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I just noticed the latch is indeed useless. Removed in ffaefbd

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is an overall response right, not just related to the few lines at the top of this thread?)

Yeah I worry that this is a very indirect way to test that property, and it'd still be a problem if we called sendRequest on a (different) generic thread from the one that's waiting on the future. I'd rather we just focussed on the actual threading behaviour here in the context of the TransportService tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I worry that this is a very indirect way to test that property

I agree. It's in addition to the same thread assertion.

You are right that assertion can still be tripped if we send the request with the generic thread pool. This is already another problem that can happen in stress test (with a slightly different code path).

I pushed e90005d to rewrite the test as you suggested.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Oct 10, 2024

@elasticmachine update branch

@ywangd ywangd removed the serverless-linked Added by automation, don't add manually label Oct 10, 2024
@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Oct 10, 2024

FYI, I plan to backport this PR since it's labelled as bug. Though no test failure has been identified in the stateful code base, it is still a possibility.

@ywangd ywangd added v8.16.0 auto-backport Automatically create backport pull requests when merged labels Oct 10, 2024
Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 10, 2024
@elasticsearchmachine elasticsearchmachine merged commit c27bc08 into elastic:main Oct 10, 2024
@ywangd ywangd deleted the future-onfailure-with-generic-executor branch October 10, 2024 10:26
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💚 Backport successful

Status Branch Result
8.x

ywangd added a commit to ywangd/elasticsearch that referenced this pull request Oct 10, 2024
…#114375)

When TransportService fails to send a transport action, it can complete
the listener's `onFailure` with the `generic` executor. If the listener
is a `PlainActionFuture` and also waits to be completed with a `generic`
thread, it will trip the `assertCompleteAllowed` assertion. 

https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064

With this PR, we no longer fork to the generic thread pool and instead
just handle the exeption inline with the current thread. The expectation
is that the downstream handler should take care potential stack overflow
issues. This is similar to what is done in elastic#109236
elasticsearchmachine pushed a commit that referenced this pull request Oct 10, 2024
#114493)

When TransportService fails to send a transport action, it can complete
the listener's `onFailure` with the `generic` executor. If the listener
is a `PlainActionFuture` and also waits to be completed with a `generic`
thread, it will trip the `assertCompleteAllowed` assertion. 

https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064

With this PR, we no longer fork to the generic thread pool and instead
just handle the exeption inline with the current thread. The expectation
is that the downstream handler should take care potential stack overflow
issues. This is similar to what is done in #109236
matthewabbott pushed a commit to matthewabbott/elasticsearch that referenced this pull request Oct 10, 2024
…#114375)

When TransportService fails to send a transport action, it can complete
the listener's `onFailure` with the `generic` executor. If the listener
is a `PlainActionFuture` and also waits to be completed with a `generic`
thread, it will trip the `assertCompleteAllowed` assertion. 

https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064

With this PR, we no longer fork to the generic thread pool and instead
just handle the exeption inline with the current thread. The expectation
is that the downstream handler should take care potential stack overflow
issues. This is similar to what is done in elastic#109236
davidkyle pushed a commit to davidkyle/elasticsearch that referenced this pull request Oct 13, 2024
…#114375)

When TransportService fails to send a transport action, it can complete
the listener's `onFailure` with the `generic` executor. If the listener
is a `PlainActionFuture` and also waits to be completed with a `generic`
thread, it will trip the `assertCompleteAllowed` assertion. 

https://github.com/elastic/elasticsearch/blob/fb482f863d5430702b19bd3dd23e9d8652f12ddd/server/src/main/java/org/elasticsearch/transport/TransportService.java#L1062-L1064

With this PR, we no longer fork to the generic thread pool and instead
just handle the exeption inline with the current thread. The expectation
is that the downstream handler should take care potential stack overflow
issues. This is similar to what is done in elastic#109236
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…tic#2966)

We do not pass down the executor for the GetVBCCChunk action so that we
can retain the bytes on the transport thread. The response is then
fulfilled with the dedicated fill_vbcc executor. There is no such
concern on the failure path since no bytes are available.
With this PR, we fork to the same thread pool before completing the
listener exceptionally. This helps avoid occassional
PlainActionFture#assertCompleteAllowed failures.

Relates: elastic#114375
Relates: elastic#2933
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
There are a few edge cases where closing a node can causes test failures:

Closing the handling node when looking up master node name.
Closing the coordinating node when a search is ongoing. This can lead to leaking search context in MockSearchService on the data nodes.
Closing the data node when a search is ongoing. This can lead to leaking resource on the coordinating node.
This PR fixes 1 by avoiding lookup since the master node does not change and is already known. It fixes 2 by always uses master node as the coordinating node. It fixes 3 by avoid restarting search node. With these changes in place (along with elastic#2790, elastic#2966, elastic#2983, elastic#112748, elastic#114375) the test is stable enough (running in a loop for 40+ hours without failure) to be unmuted.

Resolves: elastic#2327
Resolves:
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
…tic#2966)

We do not pass down the executor for the GetVBCCChunk action so that we
can retain the bytes on the transport thread. The response is then
fulfilled with the dedicated fill_vbcc executor. There is no such
concern on the failure path since no bytes are available.
With this PR, we fork to the same thread pool before completing the
listener exceptionally. This helps avoid occassional
PlainActionFture#assertCompleteAllowed failures.

Relates: elastic#114375
Relates: elastic#2933
breskeby pushed a commit to breskeby/elasticsearch that referenced this pull request Feb 11, 2026
There are a few edge cases where closing a node can causes test failures:

Closing the handling node when looking up master node name.
Closing the coordinating node when a search is ongoing. This can lead to leaking search context in MockSearchService on the data nodes.
Closing the data node when a search is ongoing. This can lead to leaking resource on the coordinating node.
This PR fixes 1 by avoiding lookup since the master node does not change and is already known. It fixes 2 by always uses master node as the coordinating node. It fixes 3 by avoid restarting search node. With these changes in place (along with elastic#2790, elastic#2966, elastic#2983, elastic#112748, elastic#114375) the test is stable enough (running in a loop for 40+ hours without failure) to be unmuted.

Resolves: elastic#2327
Resolves:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. v8.16.0 v9.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants