Skip to content

Keep track and release Reactor Netty 4 Transport accepted Http Channels during the Node shutdown#20106

Merged
andrross merged 14 commits intoopensearch-project:mainfrom
fdesu:issue/20034-fix-flaky-msearch-error-reporting-smoke-tests
Dec 2, 2025
Merged

Keep track and release Reactor Netty 4 Transport accepted Http Channels during the Node shutdown#20106
andrross merged 14 commits intoopensearch-project:mainfrom
fdesu:issue/20034-fix-flaky-msearch-error-reporting-smoke-tests

Conversation

@fdesu
Copy link
Copy Markdown
Contributor

@fdesu fdesu commented Nov 26, 2025

Description

It appears that after executing Multi Search API, the HttpChannel stays registered with the RestCancellableNodeClient as the underlying channel may not always get closed.

The junit rule that supposed to close the Rest Client (the OpenSearchTestClusterRule) during the cleanup phase doesn't always close the server-side channel which then pollutes the RestCancellableNodeClient and potentially causes more havoc.

The solution is rather simple - close the Rest Client within the test case and reliably release all underlying resources. Running relevant tests multiple times with the fix applied seems to eliminate the issue.

Related Issues

Resolves #20034

Check List

  • Functionality includes testing.
  • [ ] API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Bug Fixes

    • Reactor Netty 4 Transport now properly tracks and releases accepted HTTP channels during node shutdown, improving resource management.
  • Tests

    • Added integration tests to verify HTTP channel lifecycle and cleanup behavior under concurrent requests.
    • Enhanced connection handling tests for HTTP transport stability.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
@fdesu fdesu requested a review from a team as a code owner November 26, 2025 16:40
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Nov 26, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 26, 2025

Walkthrough

Added channel tracking and release functionality for Reactor Netty 4 HTTP transport during node shutdown. Introduced integration tests to validate HTTP channel lifecycle management under concurrent requests, modified channel closure to handle async context coordination, and updated request consumers to register accepted channels with the transport.

Changes

Cohort / File(s) Summary
Documentation & Testing Setup
CHANGELOG.md, qa/smoke-test-http/src/test/java/org/opensearch/http/DetailedErrorsDisabledIT.java
Added changelog entry for channel tracking fix; minor formatting and static import reorganization in existing test
Channel Lifecycle Integration Tests
modules/transport-netty4/src/internalClusterTest/java/org/opensearch/transport/netty4/Netty4HttpChannelsReleaseIntegTests.java, plugins/transport-reactor-netty4/src/internalClusterTest/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpChannelsReleaseIntegTests.java
New integration test classes validating HTTP channel tracking and cleanup during concurrent requests and node shutdown across both transports
Channel Management & Lifecycle Coordination
plugins/transport-reactor-netty4/src/main/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpServerChannel.java, plugins/transport-reactor-netty4/src/main/java/org/opensearch/http/reactor/netty4/ReactorNetty4NonStreamingHttpChannel.java, plugins/transport-reactor-netty4/src/main/java/org/opensearch/http/reactor/netty4/ReactorNetty4StreamingHttpChannel.java
Updated toString representation and modified close implementations to handle async context coordination for proper channel lifecycle management
Transport & Consumer Channel Registration
plugins/transport-reactor-netty4/src/main/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpServerTransport.java, plugins/transport-reactor-netty4/src/main/java/org/opensearch/http/reactor/netty4/ReactorNetty4NonStreamingRequestConsumer.java, plugins/transport-reactor-netty4/src/main/java/org/opensearch/http/reactor/netty4/ReactorNetty4StreamingRequestConsumer.java
Added serverAcceptedChannel override, updated transport type specificity, and modified consumer constructors to register channels and pass transport instance through request pipeline
Streaming Request Tests
plugins/transport-reactor-netty4/src/test/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpServerTransportStreamingTests.java, plugins/transport-reactor-netty4/src/test/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpServerTransportTests.java
Refactored streaming test with helper method extraction and added new test for connection lifecycle validation during streaming and non-streaming scenarios

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Transport as ReactorNetty4HttpServerTransport
    participant Consumer as ReactorNetty4StreamingRequestConsumer
    participant Channel as HttpChannel
    participant CloseContext

    Client->>Transport: Incoming HTTP request
    Transport->>Consumer: Create consumer (pass transport)
    Consumer->>Transport: serverAcceptedChannel(channel)
    Transport->>Channel: Register accepted channel
    
    Consumer->>Consumer: Process request
    Consumer->>Channel: Close on completion
    
    alt closeContext not completed
        Channel->>CloseContext: addListener(close future)
        CloseContext->>Channel: Complete → triggers close
    else closeContext already completed
        Channel->>Channel: Close immediately
    end
    
    Note over Transport,Channel: On node shutdown
    Transport->>Transport: Track and release all<br/>accepted channels
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Files to focus on:
    • ReactorNetty4StreamingRequestConsumer.java and ReactorNetty4NonStreamingRequestConsumer.java — verify correct transport type narrowing and channel registration timing
    • ReactorNetty4StreamingHttpChannel.java and ReactorNetty4NonStreamingHttpChannel.java — carefully review async close coordination logic with closeContext completion state
    • ReactorNetty4HttpServerTransport.java — confirm serverAcceptedChannel override integrates properly with the request pipeline
    • Integration test classes — validate test assertions accurately capture channel lifecycle expectations

Suggested labels

transport, netty, test

Suggested reviewers

  • andrross
  • reta

Poem

A rabbit hops through channels bright,
Tracking requests, day and night,
Close contexts bind with async grace,
As channels find their resting place. 🐰✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.19% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: tracking and releasing Reactor Netty 4 Transport HTTP channels during node shutdown to address channel lifecycle management issues.
Description check ✅ Passed The description provides a clear explanation of the problem (HttpChannel not getting closed after Multi Search API), the root cause (channel registration issue), and the solution approach. It includes a reference to the related issue (#20034) and confirms testing was added.
Linked Issues check ✅ Passed The PR addresses the root cause of flaky DetailedErrorsDisabledIT tests by implementing proper channel lifecycle tracking in ReactorNetty4HttpServerTransport, ensuring accepted channels are registered and released on node shutdown as required by issue #20034.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the HTTP channel lifecycle issue: channel tracking infrastructure in transport classes, channel release logic, test infrastructure for validation, and related integration tests. No unrelated modifications detected.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d91400a and 93a3457.

📒 Files selected for processing (2)
  • modules/transport-netty4/src/internalClusterTest/java/org/opensearch/transport/netty4/Netty4HttpChannelsReleaseIntegTests.java (1 hunks)
  • plugins/transport-reactor-netty4/src/internalClusterTest/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpChannelsReleaseIntegTests.java (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • modules/transport-netty4/src/internalClusterTest/java/org/opensearch/transport/netty4/Netty4HttpChannelsReleaseIntegTests.java
  • plugins/transport-reactor-netty4/src/internalClusterTest/java/org/opensearch/http/reactor/netty4/ReactorNetty4HttpChannelsReleaseIntegTests.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, windows-2025, true)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: detect-breaking-change
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: precommit (25, macos-15-intel)
  • GitHub Check: precommit (25, ubuntu-latest)
  • GitHub Check: precommit (21, ubuntu-24.04-arm)
  • GitHub Check: precommit (21, ubuntu-latest)
  • GitHub Check: precommit (21, macos-15)
  • GitHub Check: precommit (25, ubuntu-24.04-arm)
  • GitHub Check: precommit (25, macos-15)
  • GitHub Check: precommit (21, windows-latest)
  • GitHub Check: precommit (21, macos-15-intel)
  • GitHub Check: precommit (25, windows-latest)

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
…ky-msearch-error-reporting-smoke-tests

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
@andrross
Copy link
Copy Markdown
Member

Thanks for the quick follow up here, @fdesu! Why does this channel not get closed? Manually closing the client seems to violate the intent of the test framework per this comment:

* The returned client gets automatically closed when needed, it shouldn't be closed as part of tests otherwise
* it cannot be reused by other tests anymore.

@fdesu
Copy link
Copy Markdown
Contributor Author

fdesu commented Nov 26, 2025

@andrross oh, good point, wasn’t aware of that!

I’m not entirely sure why the channel doesn’t get closed automatically but it just stays open since the deregistration doesn’t fire up which would have caused the RestCancellableNodeClient housekeeping to kick in. Maybe it’s somehow related to the channel type which is Reactor Netty4 kind of channel but I’m not really sure.

Another thought that I had in mind was that the test(s) don’t fully consume the Response body and hence the receiving side just keeps the channel alive but it doesn’t seem to be the case since we use EntityUtils.toString which tries-with-resources the underlying InputStream.

I’ll look deeper and would really appreciate if anyone can point me in the right direction.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 25166b1: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@peterzhuamazon
Copy link
Copy Markdown
Member

Thanks for the quick follow up here, @fdesu! Why does this channel not get closed? Manually closing the client seems to violate the intent of the test framework per this comment:

* The returned client gets automatically closed when needed, it shouldn't be closed as part of tests otherwise
* it cannot be reused by other tests anymore.

@coderabbitai do you find @andrross comment above helpful to improve your review?

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Nov 26, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Fetch the referenced file to understand the framework's intent
cat test/framework/src/main/java/org/opensearch/test/OpenSearchIntegTestCase.java | sed -n '2370,2395p'

Length of output: 933


🏁 Script executed:

#!/bin/bash
# Look for how the rest client is created and managed in the test framework
rg -n -A5 -B5 "createRestClient|getRestClient" test/framework/src/main/java/org/opensearch/test/OpenSearchIntegTestCase.java | head -80

Length of output: 693


🏁 Script executed:

#!/bin/bash
# Check the OpenSearchTestClusterRule class to understand how it manages resources
fd -e java "OpenSearchTestClusterRule" --exec cat {}

Length of output: 17682


🏁 Script executed:

#!/bin/bash
# Look at how RestCancellableNodeClient manages channel registrations
rg -n -A10 -B5 "class RestCancellableNodeClient" --type java

Length of output: 3789


@peterzhuamazon Yes, @andrross's comment is very helpful and raises a valid architectural concern that should influence the review.

After examining the framework code, I can confirm that the test framework is explicitly designed to:

  1. Reuse the RestClient across tests: The getRestClient() method creates a singleton client that's shared within the test scope
  2. Handle cleanup centrally: The clearClusters() method in OpenSearchTestClusterRule is responsible for closing the client and verifying all channels are deregistered from RestCancellableNodeClient

The framework documentation at lines 2381-2382 explicitly states: "The returned client gets automatically closed when needed, it shouldn't be closed as part of tests otherwise it cannot be reused by other tests anymore."

The Core Issue:

The current fix uses try-with-resources to close the client in individual test methods, which:

  • ❌ Violates the framework's design intent
  • ❌ Breaks client reusability across tests in the same scope
  • ❌ Treats the symptom (channels not closing) rather than the root cause

What Needs Investigation:

The real question is: Why doesn't the HttpChannel close automatically through the normal framework lifecycle? As @fdesu noted, it could be related to:

  • Reactor Netty4 channel behavior
  • Channel lifecycle management differences
  • Incomplete response consumption (though EntityUtils.toString should handle this)

Recommendation:

Rather than working around the framework's design, I'd suggest:

  1. Investigating why the channel remains open after clearClusters() is called
  2. Determining if the framework's cleanup logic needs enhancement for Reactor Netty4 channels
  3. Fixing the root cause so the framework's intended lifecycle works correctly

This would be a more robust solution that maintains the framework's architecture and benefits all tests using the framework.

@fdesu
Copy link
Copy Markdown
Contributor Author

fdesu commented Nov 26, 2025

I mean, I don't think it's the test harness's design that causes the issue, it's something else. My expectation is, after the restClient.performRequest(MultisearchRequest) is called and the response is fully consumed, the channel should get closed which doesn't happen. Am I missing anything?

Note that the:

  1. org.opensearch.client.SearchIT#testMultiSearch works okay with the high level client
  2. and the SearchRestCancellationIT#testAutomaticCancellationMultiSearchDuringQueryPhase works okay with getRestClient().performRequestAsync() which gets cancelled.

@andrross
Copy link
Copy Markdown
Member

the channel should get closed which doesn't happen. Am I missing anything?

@fdesu I don't think you're missing anything. I think there might be a latent bug here with the Reactor Netty implementation. The channel gets added to a collection in AbstractHttpServerTransport#serverAcceptedChannel, which will guarantee that on node shutdown those channels get closed here. However, I don't see that AbstractHttpServerTransport#serverAcceptedChannel is ever called by ReactorNetty4HttpServerTransport, so when using the reactor netty implementation that httpChannels collection is always empty. I don't understand why the failure is intermittent though. @reta Do you have any idea what might be happening?

@fdesu fdesu marked this pull request as draft November 27, 2025 08:47
Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
…ky-msearch-error-reporting-smoke-tests

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
@fdesu fdesu changed the title Fix flaky DetailedErrorsDisabledIT and DetailedErrorsEnabledIT Keep track and release Reactor Netty 4 Transport accepted Http Channels during the Node shutdown Nov 27, 2025
Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4abed82: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@reta
Copy link
Copy Markdown
Contributor

reta commented Nov 27, 2025

However, I don't see that AbstractHttpServerTransport#serverAcceptedChannel is ever called by ReactorNetty4HttpServerTransport, so when using the reactor netty implementation that httpChannels collection is always empty. I don't understand why the failure is intermittent though. @reta Do you have any idea what might be happening?

Looking @andrross , thanks for taking it @fdesu

@fdesu
Copy link
Copy Markdown
Contributor Author

fdesu commented Nov 27, 2025

@andrross @reta great point, lots of thanks for this hint!

I can confirm the source of failures was randomness in what transport gets picked for the smoke tests:

nodeHttpTypeKey = getHttpTypeKey(randomFrom(Netty4ModulePlugin.class, ReactorNetty4Plugin.class));

when ReactorNetty4Plugin was picked, some request channels might not get closed on the node shutdown which would cause issues in the OpenSearchTestClusterRule#clearClusters rule.

I've added a channel init lifecycle hook for the ReactorNetty4HttpServerTransport which is similar to Netty4HttpServerTransport. The added channel wrapper gets registered with the AbstractHttpServerTransport#httpChannels when a new channel gets accepted. That would cause all channels to get closed on the node shutdown which is expected behaviour.

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
…r proper context access

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
@reta
Copy link
Copy Markdown
Contributor

reta commented Nov 27, 2025

@fdesu I just figured out that I think we do have much simpler solution for the problem, by modifying ReactorNetty4NonStreamingRequestConsumer::process method and adding transport.serverAcceptedChannel(channel):

void process(HttpContent in, FluxSink<HttpContent> emitter) {
        // Consume request body in full before dispatching it
        content.addComponent(true, in.content().retain());

        if (in instanceof LastHttpContent) {
            final ReactorNetty4NonStreamingHttpChannel channel = new ReactorNetty4NonStreamingHttpChannel(request, response, emitter);
            final HttpRequest r = createRequest(request, content);

            try {
                transport.serverAcceptedChannel(channel);
                transport.incomingRequest(r, channel);
            } catch (Exception ex) {
                emitter.error(ex);
                transport.onException(channel, ex);
            } finally {
                r.release();
                if (disposed.compareAndSet(false, true)) {
                    this.content.release();
                }
            }
        }
    }

The serverAcceptedChannel is not visible (protected) but we could add an override to ReactorNetty4HttpServerTransport and use ReactorNetty4HttpServerTransport instead of AbstractHttpServerTransport.

    @Override
    public void serverAcceptedChannel(HttpChannel httpChannel) {
        super.serverAcceptedChannel(httpChannel);
    }

We don't need wrappers and initializers in this case

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
@fdesu
Copy link
Copy Markdown
Contributor Author

fdesu commented Nov 27, 2025

@reta interesting, I assume the streaming request consumer would need to do the same for the LastHttpContent branch, correct?

I'm not entirely sure but the ReactorNetty4HttpChannel solution also looks pretty simple and being applied on the ReactorNetty4HttpServerTransport level. That makes it pretty independent of the consumers that are, I guess, only responsible for consuming the inbound request. What do you think?

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 93a3457: SUCCESS

Copy link
Copy Markdown
Contributor

@reta reta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @fdesu ! @andrross anything you would like to mention?

@andrross andrross merged commit 7d97f77 into opensearch-project:main Dec 2, 2025
34 of 35 checks passed
@fdesu fdesu deleted the issue/20034-fix-flaky-msearch-error-reporting-smoke-tests branch December 2, 2025 08:24
rgsriram pushed a commit to rgsriram/OpenSearch that referenced this pull request Dec 5, 2025
…ls during the Node shutdown (opensearch-project#20106)

* Close RestClient after MSearch API calls to reliably close HttpChannels

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add the CHANGELOG entry

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Keep track of Reactor Netty 4 Transport accepted Http Channels

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Update CHANGELOG

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add missing visibility modifier to RestCancellableNodeClient#getNumTasks

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Extract channel wrapper into a separate class and extend the get() for proper context access

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add missing javadocs

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Hook into AbstractHttpServerTransport directly from Reactor Request Consumers

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Make sure the CloseContext is always called even if the channel is disposed already

Signed-off-by: Andriy Redko <drreta@gmail.com>

* Remove extra code

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add more coverage for streaming and non-streaming netty connections

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Revert visibility modifier of the  RestCancellableNodeClient#getNumTasks

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

---------

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
Signed-off-by: Andriy Redko <drreta@gmail.com>
Co-authored-by: Andriy Redko <drreta@gmail.com>
liuguoqingfz pushed a commit to liuguoqingfz/OpenSearch that referenced this pull request Dec 15, 2025
…ls during the Node shutdown (opensearch-project#20106)

* Close RestClient after MSearch API calls to reliably close HttpChannels

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add the CHANGELOG entry

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Keep track of Reactor Netty 4 Transport accepted Http Channels

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Update CHANGELOG

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add missing visibility modifier to RestCancellableNodeClient#getNumTasks

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Extract channel wrapper into a separate class and extend the get() for proper context access

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add missing javadocs

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Hook into AbstractHttpServerTransport directly from Reactor Request Consumers

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Make sure the CloseContext is always called even if the channel is disposed already

Signed-off-by: Andriy Redko <drreta@gmail.com>

* Remove extra code

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Add more coverage for streaming and non-streaming netty connections

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

* Revert visibility modifier of the  RestCancellableNodeClient#getNumTasks

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>

---------

Signed-off-by: Sergei Ustimenko <fdesu@proton.me>
Signed-off-by: Andriy Redko <drreta@gmail.com>
Co-authored-by: Andriy Redko <drreta@gmail.com>
@andrross andrross mentioned this pull request Dec 15, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for DetailedErrorsDisabledIT

4 participants