Skip to content

ESQL: Distributed execution for external data sources - Meta Issue #142996

@costin

Description

@costin

ESQL: Distributed execution for external data sources

External data sources (EXTERNAL command) currently execute entirely on the coordinator — single driver, single thread, no parallelism. For large datasets, especially with aggregations, this creates a throughput bottleneck and concentrated memory pressure on the coordinator.

This meta issue tracks the work to add proper distribution and parallelism to external source execution. The approach introduces a split-based execution model where external sources are partitioned into independent units of work (ExternalSplit), discovered through a pluggable SplitProvider, and optionally distributed across data nodes using the existing exchange infrastructure.

Key characteristics:

  • Pluggable partition detection: Hive-style key=value paths auto-detected by default, bare directory layouts (Kinesis Firehose, CloudTrail, etc.) supported via {name} path templates in WITH configuration. Partition-aware glob rewriting reduces file listing scope before any data is read.
  • Virtual partition columns: partition values from file paths exposed as queryable columns (appended at tail, same position as metadata columns). On name conflict with data columns, path-derived values take precedence.
  • Three-level filter model: partition pruning (at split discovery), data filter pushdown (per-node, translated to native format), engine filter (remainder in plan)
  • Pluggable distribution strategy with basic query analysis (distribute when aggregations are present and multiple splits exist; keep on coordinator for simple queries where distribution overhead is not justified)
  • ES|QL Expression objects (already NamedWriteable) serve as the canonical filter representation across nodes — no need to serialize connector-native filter types

Steps

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions