Skip to content

ESQL: Fix flaky EmployeeFlightServerTests shutdown race#143657

Merged
costin merged 2 commits intoelastic:mainfrom
costin:fix/grpc-flight-test-shutdown-race
Mar 5, 2026
Merged

ESQL: Fix flaky EmployeeFlightServerTests shutdown race#143657
costin merged 2 commits intoelastic:mainfrom
costin:fix/grpc-flight-test-shutdown-race

Conversation

@costin
Copy link
Copy Markdown
Member

@costin costin commented Mar 5, 2026

Fixes a flaky test failure in EmployeeFlightServerTests.testMultiEndpointReturnsCorrectEndpointCount caused by a race condition during gRPC/Netty server shutdown.

When the test's FlightClient closes, gRPC's NettyServerHandler processes the disconnect asynchronously on Netty's event loop. A ChannelFutureListener callback (closeStreamWhenDone) can throw if the HTTP/2 stream was already closed by the time it fires. Netty's DefaultPromise logs this as a WARN, which ESTestCase.checkStaticState() captures and treats as a test failure.

The fix replaces the manual shutdown() + awaitTermination() with FlightServer.close() which provides a more robust shutdown sequence (graceful shutdown → awaitTerminationshutdownNow fallback), and adds a brief sleep to let any remaining Netty event loop callbacks drain before the test framework checks for logged warnings.

Closes #143636

@costin costin added >test Issues or PRs that are addressing/adding tests :Analytics/ES|QL AKA ESQL ES|QL|DS ES|QL datasources labels Mar 5, 2026
@costin costin requested a review from bpintea March 5, 2026 06:55
@elasticsearchmachine elasticsearchmachine added v9.4.0 Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Mar 5, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

costin added 2 commits March 5, 2026 12:58
Use FlightServer.close() for robust gRPC shutdown and add a brief
sleep to let Netty event loop callbacks drain before the test ends.

Closes elastic#143636
@costin costin force-pushed the fix/grpc-flight-test-shutdown-race branch from e96c3fa to 82b3bfc Compare March 5, 2026 10:59
Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@costin costin merged commit ac250b7 into elastic:main Mar 5, 2026
35 checks passed
@costin costin deleted the fix/grpc-flight-test-shutdown-race branch March 5, 2026 16:09
spinscale pushed a commit to spinscale/elasticsearch that referenced this pull request Mar 6, 2026
Fixes a flaky test failure in `EmployeeFlightServerTests.testMultiEndpointReturnsCorrectEndpointCount` caused by a race condition during gRPC/Netty server shutdown.

When the test's `FlightClient` closes, gRPC's `NettyServerHandler` processes the disconnect asynchronously on Netty's event loop. A `ChannelFutureListener` callback (`closeStreamWhenDone`) can throw if the HTTP/2 stream was already closed by the time it fires. Netty's `DefaultPromise` logs this as a WARN, which `ESTestCase.checkStaticState()` captures and treats as a test failure.

The fix replaces the manual `shutdown()` + `awaitTermination()` with `FlightServer.close()` which provides a more robust shutdown sequence (graceful shutdown → `awaitTermination` → `shutdownNow` fallback), and adds a brief sleep to let any remaining Netty event loop callbacks drain before the test framework checks for logged warnings.

Closes elastic#143636
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Mar 6, 2026
Fixes a flaky test failure in `EmployeeFlightServerTests.testMultiEndpointReturnsCorrectEndpointCount` caused by a race condition during gRPC/Netty server shutdown.

When the test's `FlightClient` closes, gRPC's `NettyServerHandler` processes the disconnect asynchronously on Netty's event loop. A `ChannelFutureListener` callback (`closeStreamWhenDone`) can throw if the HTTP/2 stream was already closed by the time it fires. Netty's `DefaultPromise` logs this as a WARN, which `ESTestCase.checkStaticState()` captures and treats as a test failure.

The fix replaces the manual `shutdown()` + `awaitTermination()` with `FlightServer.close()` which provides a more robust shutdown sequence (graceful shutdown → `awaitTermination` → `shutdownNow` fallback), and adds a brief sleep to let any remaining Netty event loop callbacks drain before the test framework checks for logged warnings.

Closes elastic#143636
dnhatn pushed a commit that referenced this pull request Mar 8, 2026
CsvIT doesn't use force merging so there may be small differences in 
results due to float processing.

Unmuting also some tests that were fixed earlier but #143657 muted 
again, likely due to a merge conflict.

Fixes #143758
Fixes #143757
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL ES|QL|DS ES|QL datasources Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] EmployeeFlightServerTests testMultiEndpointReturnsCorrectEndpointCount failing

3 participants