[ES|QL] Harden distributed external source execution#144277
Merged
costin merged 9 commits intoelastic:mainfrom Mar 18, 2026
Merged
[ES|QL] Harden distributed external source execution#144277costin merged 9 commits intoelastic:mainfrom
costin merged 9 commits intoelastic:mainfrom
Conversation
Collaborator
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Collaborator
|
Hi @costin, I've created a changelog YAML for you. |
8af5bd3 to
960b536
Compare
Adds ExternalDistributedStressIT with synthetic CSV generation to verify correctness under 500-1500 splits across all three distribution modes (coordinator_only, round_robin, adaptive).
Add DNS-aware retry policy, adaptive timeout budgets, fault injection wiring in S3 fixture, and multi-node resilience tests for distributed external source queries.
960b536 to
3cb050d
Compare
17881e0 to
5b841ba
Compare
Replace LoggerMessageFormat.format with Strings.format across all distributed ITs to avoid Java overload resolution picking the (String prefix, String pattern, Object...) signature which garbles both query construction and assertion messages. Reduce stress test split counts from 500-1500 to 50-200 to prevent OOM-killing the 2-node CI cluster.
5b841ba to
3d13ce9
Compare
Ensure external-source execution uses the slice queue when a data node receives explicit splits without a resolved FileSet, including the single-coalesced-split case. Add a regression test for the planner.
michalborek
pushed a commit
to michalborek/elasticsearch
that referenced
this pull request
Mar 23, 2026
This PR hardens the distributed execution path for external source queries (EXTERNAL command) by improving retry resilience, adding fault injection infrastructure, and introducing stress tests for many-split scenarios. * Retry improvements:** - DNS resolution failures (`UnknownHostException` inside `ConnectException`) are no longer retried, avoiding wasted retry budget on misconfigured bucket URLs. - `RetryPolicy` gains a total-duration budget so retries cannot accumulate delays beyond the query's execution time. - `ExternalSourceDrainUtils` drain timeout is now parameterizable and plumbed from the query deadline through `SourceOperatorContext` and `AsyncExternalSourceOperatorFactory`. * Fault injection infrastructure:** - `DataSourcesS3HttpFixture.createHandler()` wraps the S3 handler with `FaultInjectingS3HttpHandler`, enabling tests to toggle HTTP 503/500, connection resets, slow responses, and truncated responses during live queries. - `AbstractExternalSourceSpecTestCase` exposes the fault handler to subclasses. * Integration tests - `ExternalDistributedResilienceIT` — verifies transient fault recovery and clean failure across all three distribution modes. - `ExternalDistributedStressIT` — generates 500–1500 synthetic CSV splits with deterministic content and validates `COUNT(*)`, `SUM(value)`, and `SORT … LIMIT` correctness across distributed execution. Developed with AI-assisted tooling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR hardens the distributed execution path for external source queries
(EXTERNAL command) by improving retry resilience, adding fault injection
infrastructure, and introducing stress tests for many-split scenarios.
Retry improvements:
UnknownHostExceptioninsideConnectException) areno longer retried, avoiding wasted retry budget on misconfigured bucket URLs.
RetryPolicygains a total-duration budget so retries cannot accumulate delaysbeyond the query's execution time.
ExternalSourceDrainUtilsdrain timeout is now parameterizable and plumbedfrom the query deadline through
SourceOperatorContextandAsyncExternalSourceOperatorFactory.Fault injection infrastructure:
DataSourcesS3HttpFixture.createHandler()wraps the S3 handler withFaultInjectingS3HttpHandler, enabling tests to toggle HTTP 503/500,connection resets, slow responses, and truncated responses during live queries.
AbstractExternalSourceSpecTestCaseexposes the fault handler to subclasses.Integration tests:
ExternalDistributedResilienceIT— verifies transient fault recovery andclean failure across all three distribution modes.
ExternalDistributedStressIT— generates 500–1500 synthetic CSV splits withdeterministic content and validates
COUNT(*),SUM(value), andSORT … LIMITcorrectness across distributed execution.
Developed with AI-assisted tooling