ESQL: Distributed execution for external data sources

# ESQL: Distributed execution for external data sources

External data sources (`EXTERNAL` command) currently execute entirely on the coordinator — single driver, single thread, no parallelism. For large datasets, especially with aggregations, this creates a throughput bottleneck and concentrated memory pressure on the coordinator.

This meta issue tracks the work to add proper distribution and parallelism to external source execution. The approach introduces a split-based execution model where external sources are partitioned into independent units of work (`ExternalSplit`), discovered through a pluggable `SplitProvider`, and optionally distributed across data nodes using the existing exchange infrastructure.

**Key characteristics:**
- Pluggable partition detection: Hive-style `key=value` paths auto-detected by default, bare directory layouts (Kinesis Firehose, CloudTrail, etc.) supported via `{name}` path templates in WITH configuration. Partition-aware glob rewriting reduces file listing scope before any data is read.
- Virtual partition columns: partition values from file paths exposed as queryable columns (appended at tail, same position as metadata columns). On name conflict with data columns, path-derived values take precedence.
- Three-level filter model: partition pruning (at split discovery), data filter pushdown (per-node, translated to native format), engine filter (remainder in plan)
- Pluggable distribution strategy with basic query analysis (distribute when aggregations are present and multiple splits exist; keep on coordinator for simple queries where distribution overhead is not justified)
- ES|QL `Expression` objects (already `NamedWriteable`) serve as the canonical filter representation across nodes — no need to serialize connector-native filter types

## Steps

- [x] **Split SPI and Hive partition detection** — `ExternalSplit` (`NamedWriteable`), `SplitProvider`, `FileSplit`; Hive-style partition column detection from file paths during glob expansion; partition filter hint extraction from parsed plan; glob pattern rewriting for `=`/`>=`/`IN` predicates (https://github.com/elastic/elasticsearch/pull/143005)
- [x] **Split discovery phase** — `ExternalSourceExec` carries splits; split discovery wired into `ComputeService` after physical planning; L1 partition pruning with full expression evaluation against path-derived partition values (https://github.com/elastic/elasticsearch/pull/143114)
- [x] **Pluggable partition detection and virtual columns** — `PartitionDetector` interface with Hive, template, and auto-detect implementations; `TemplatePartitionDetector` for bare directory layouts via `{name}` path templates; partition columns added to schema as virtual columns; `VirtualColumnInjector` produces `ConstantBlock`s per partition value per output Page at the source operator level (https://github.com/elastic/elasticsearch/pull/143120)
- [x] **Local parallelism** — `ExternalSliceQueue`; `LocalExecutionPlanner` creates multiple drivers per external source; each driver processes a different split; operator factories reworked for per-split operation (https://github.com/elastic/elasticsearch/pull/143154)
- [x] **Distribution strategy and plan structure** — `ExternalDistributionStrategy` (pluggable); `AdaptiveStrategy` with basic query analysis; `NodeEligibilityStrategy` hook; `Mapper` inserts `ExchangeExec` for distributable external sources; pragma `external_distribution` (#143194)
- [x] **Data node external source execution** — `DataNodeRequest` carries `ExternalSplit` assignments; `DataNodeComputeHandler` external source branch; `FilterPushdownRegistry` wired on data nodes for per-node L2 filter translation; end-to-end distributed execution through existing exchange infrastructure (#143209)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Distributed execution for external data sources - Meta Issue #142996

Steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ESQL: Distributed execution for external data sources - Meta Issue #142996

Description

ESQL: Distributed execution for external data sources

Steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions