Skip to content

ESQL: Add split SPI, partition detection, and filter hint extraction#143005

Merged
costin merged 11 commits intoelastic:mainfrom
costin:esql/ds-distributed/stage-1
Feb 25, 2026
Merged

ESQL: Add split SPI, partition detection, and filter hint extraction#143005
costin merged 11 commits intoelastic:mainfrom
costin:esql/ds-distributed/stage-1

Conversation

@costin
Copy link
Copy Markdown
Member

@costin costin commented Feb 24, 2026

Summary

Foundation for external source distribution (stage 1 of the multi-PR plan). Introduces
the core abstractions needed to parallelize and distribute external source reads across
nodes, along with Hive-style partition pruning at the glob-expansion layer.

  • ExternalSplit SPI (ExternalSplit, FileSplit, SplitProvider, SplitDiscoveryContext): serializable work-unit abstraction for cross-node distribution (wired in PR 2)
  • HivePartitionDetector: parses key=value segments from file paths, handles URL-encoded values, infers types via StringUtils.parseIntegral/parseDouble
  • PartitionFilterHintExtractor: walks unresolved plan extracting simple predicates (=, !=, >, >=, <, <=, IN) from Filter nodes above UnresolvedExternalRelation; normalizes BytesRef values to String
  • GlobExpander: partition-aware glob rewriting using filter hints, with glob-metacharacter escaping
  • FileSet: extended to carry PartitionMetadata through to split discovery
  • ExternalSourceResolver / EsqlSession: wired hint extraction into the resolution path
  • EsqlPlugin: registered FileSplit as a NamedWriteable

Relates #142996

Developed using AI-assisted tooling

Foundation for external source distribution (stage 1). Introduces the
split abstraction for parallelizable work units, Hive-style partition
detection with type inference, and filter hint extraction from unresolved
plans for partition pruning during glob expansion.

- ExternalSplit SPI, FileSplit, SplitProvider, SplitDiscoveryContext
- HivePartitionDetector: key=value parsing, URL decoding, type inference
- PartitionFilterHintExtractor: walks Filter nodes above external relations
- GlobExpander: partition-aware glob rewriting with hint escaping
- FileSet: carries PartitionMetadata from detection through to splits
- ExternalSourceResolver/EsqlSession: wired hint extraction into resolution
- EsqlPlugin: registered FileSplit NamedWriteable

Developed using AI-assisted tooling
@costin costin added >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) :Analytics/ES|QL AKA ESQL v9.4.0 labels Feb 24, 2026
@costin costin requested a review from bpintea February 24, 2026 22:32
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @costin, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @costin, I've updated the changelog YAML for you.

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @costin, I've updated the changelog YAML for you.

Copy link
Copy Markdown
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. I've left some naming choices notes, mostly.

*
* <p>Unlike {@link Split} (a marker interface for connector-internal use that is never
* serialized), {@code ExternalSplit} extends {@link NamedWriteable} to support
* cross-node distribution in PR 4/5.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) I guess this level of detail isn't necessary in the docs. (Can/should be removed later).

import java.util.Objects;

/**
* A split representing a single file (or byte range within a file) in a file-based external source.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should emphasise more the fact that it can be a slice of a file, rather than having this as a secondary possibility, I think. (Even though that's not currently the case, i.e. a file == a file split.)

* serialized), {@code ExternalSplit} extends {@link NamedWriteable} to support
* cross-node distribution in PR 4/5.
*/
public interface ExternalSplit extends NamedWriteable {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find the binding of the (current?) command name (EXTERNAL) to the concept a bit misleading. Also, "Split" seems a bit unclear, since we're then also having "Partitions" mentioned. "Split" doesn't seem very common in this work-division space - "slice", "partition", "chung", "shard", "batch"?
I assume this would apply to non-file-based connectors, like Arrow Flight, JDBC, (generic REST)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is hard :)
I tend to use Partition as a concept and Split as the actual implementation for it; used ExternalSplit namely because it's related to External command (or datasources).
Shard = ES related
Slice = byte/integer/concrete data related


String sourceType();

default long estimatedSizeInBytes() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for file-based connectors, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In most cases yet however it can also work on storages such as S3 or HTTP that can indicate the range size and thus influence the split size (and thus number).

…bExpander

Introduces PartitionFilterHintExtractor.Operator enum replacing raw
string comparisons. Refactors rewriteGlobWithHints into smaller methods:
indexRewritableHints and rewriteSegment. Collapses single-value IN with
EQUALS (same glob output). Adds canRewriteGlob() and isSingleValue()
to keep glob-rewriting logic out of string comparisons.

Developed using AI-assisted tooling
@costin costin enabled auto-merge (squash) February 25, 2026 16:40
@costin costin merged commit 4b7e191 into elastic:main Feb 25, 2026
36 checks passed
@costin costin deleted the esql/ds-distributed/stage-1 branch February 25, 2026 21:38
@tylerperk tylerperk added the ES|QL|DS ES|QL datasources label Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >enhancement ES|QL|DS ES|QL datasources Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants