Skip to content

[ES|QL] Harden distributed external source execution#144277

Merged
costin merged 9 commits intoelastic:mainfrom
costin:esql/byte-based-buffer-backpressure
Mar 18, 2026
Merged

[ES|QL] Harden distributed external source execution#144277
costin merged 9 commits intoelastic:mainfrom
costin:esql/byte-based-buffer-backpressure

Conversation

@costin
Copy link
Copy Markdown
Member

@costin costin commented Mar 15, 2026

This PR hardens the distributed execution path for external source queries
(EXTERNAL command) by improving retry resilience, adding fault injection
infrastructure, and introducing stress tests for many-split scenarios.

Retry improvements:

  • DNS resolution failures (UnknownHostException inside ConnectException) are
    no longer retried, avoiding wasted retry budget on misconfigured bucket URLs.
  • RetryPolicy gains a total-duration budget so retries cannot accumulate delays
    beyond the query's execution time.
  • ExternalSourceDrainUtils drain timeout is now parameterizable and plumbed
    from the query deadline through SourceOperatorContext and
    AsyncExternalSourceOperatorFactory.

Fault injection infrastructure:

  • DataSourcesS3HttpFixture.createHandler() wraps the S3 handler with
    FaultInjectingS3HttpHandler, enabling tests to toggle HTTP 503/500,
    connection resets, slow responses, and truncated responses during live queries.
  • AbstractExternalSourceSpecTestCase exposes the fault handler to subclasses.

Integration tests:

  • ExternalDistributedResilienceIT — verifies transient fault recovery and
    clean failure across all three distribution modes.
  • ExternalDistributedStressIT — generates 500–1500 synthetic CSV splits with
    deterministic content and validates COUNT(*), SUM(value), and SORT … LIMIT
    correctness across distributed execution.

Developed with AI-assisted tooling

@costin costin requested a review from bpintea March 15, 2026 19:44
@costin costin enabled auto-merge (squash) March 15, 2026 19:45
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 15, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @costin, I've created a changelog YAML for you.

@costin costin force-pushed the esql/byte-based-buffer-backpressure branch 5 times, most recently from 8af5bd3 to 960b536 Compare March 16, 2026 08:35
costin added 3 commits March 16, 2026 10:47
Adds ExternalDistributedStressIT with synthetic CSV generation
to verify correctness under 500-1500 splits across all three
distribution modes (coordinator_only, round_robin, adaptive).
Add DNS-aware retry policy, adaptive timeout budgets, fault
injection wiring in S3 fixture, and multi-node resilience tests
for distributed external source queries.
@costin costin force-pushed the esql/byte-based-buffer-backpressure branch from 960b536 to 3cb050d Compare March 16, 2026 08:47
Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖-assisted review.

@costin costin force-pushed the esql/byte-based-buffer-backpressure branch from 17881e0 to 5b841ba Compare March 16, 2026 17:07
Replace LoggerMessageFormat.format with Strings.format across
all distributed ITs to avoid Java overload resolution picking
the (String prefix, String pattern, Object...) signature which
garbles both query construction and assertion messages.

Reduce stress test split counts from 500-1500 to 50-200 to
prevent OOM-killing the 2-node CI cluster.
@costin costin force-pushed the esql/byte-based-buffer-backpressure branch from 5b841ba to 3d13ce9 Compare March 16, 2026 22:37
costin and others added 5 commits March 17, 2026 00:38
Ensure external-source execution uses the slice queue when a data node receives
explicit splits without a resolved FileSet, including the single-coalesced-split
case. Add a regression test for the planner.
@costin costin disabled auto-merge March 18, 2026 16:55
@costin costin merged commit d78c5e2 into elastic:main Mar 18, 2026
35 of 36 checks passed
@costin costin deleted the esql/byte-based-buffer-backpressure branch March 18, 2026 16:56
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
This PR hardens the distributed execution path for external source queries
(EXTERNAL command) by improving retry resilience, adding fault injection
infrastructure, and introducing stress tests for many-split scenarios.

* Retry improvements:**
- DNS resolution failures (`UnknownHostException` inside `ConnectException`) are
  no longer retried, avoiding wasted retry budget on misconfigured bucket URLs.
- `RetryPolicy` gains a total-duration budget so retries cannot accumulate delays
  beyond the query's execution time.
- `ExternalSourceDrainUtils` drain timeout is now parameterizable and plumbed
  from the query deadline through `SourceOperatorContext` and
  `AsyncExternalSourceOperatorFactory`.

* Fault injection infrastructure:**
- `DataSourcesS3HttpFixture.createHandler()` wraps the S3 handler with
  `FaultInjectingS3HttpHandler`, enabling tests to toggle HTTP 503/500,
  connection resets, slow responses, and truncated responses during live queries.
- `AbstractExternalSourceSpecTestCase` exposes the fault handler to subclasses.

* Integration tests
- `ExternalDistributedResilienceIT` — verifies transient fault recovery and
  clean failure across all three distribution modes.
- `ExternalDistributedStressIT` — generates 500–1500 synthetic CSV splits with
  deterministic content and validates `COUNT(*)`, `SUM(value)`, and `SORT … LIMIT`
  correctness across distributed execution.

Developed with AI-assisted tooling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement ES|QL|DS ES|QL datasources Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants