ESQL: Add split SPI, partition detection, and filter hint extraction#143005
ESQL: Add split SPI, partition detection, and filter hint extraction#143005costin merged 11 commits intoelastic:mainfrom
Conversation
Foundation for external source distribution (stage 1). Introduces the split abstraction for parallelizable work units, Hive-style partition detection with type inference, and filter hint extraction from unresolved plans for partition pruning during glob expansion. - ExternalSplit SPI, FileSplit, SplitProvider, SplitDiscoveryContext - HivePartitionDetector: key=value parsing, URL decoding, type inference - PartitionFilterHintExtractor: walks Filter nodes above external relations - GlobExpander: partition-aware glob rewriting with hint escaping - FileSet: carries PartitionMetadata from detection through to splits - ExternalSourceResolver/EsqlSession: wired hint extraction into resolution - EsqlPlugin: registered FileSplit NamedWriteable Developed using AI-assisted tooling
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Hi @costin, I've created a changelog YAML for you. |
|
Hi @costin, I've updated the changelog YAML for you. |
|
Hi @costin, I've updated the changelog YAML for you. |
bpintea
left a comment
There was a problem hiding this comment.
Lgtm. I've left some naming choices notes, mostly.
| * | ||
| * <p>Unlike {@link Split} (a marker interface for connector-internal use that is never | ||
| * serialized), {@code ExternalSplit} extends {@link NamedWriteable} to support | ||
| * cross-node distribution in PR 4/5. |
There was a problem hiding this comment.
:) I guess this level of detail isn't necessary in the docs. (Can/should be removed later).
| import java.util.Objects; | ||
|
|
||
| /** | ||
| * A split representing a single file (or byte range within a file) in a file-based external source. |
There was a problem hiding this comment.
The comment should emphasise more the fact that it can be a slice of a file, rather than having this as a secondary possibility, I think. (Even though that's not currently the case, i.e. a file == a file split.)
| * serialized), {@code ExternalSplit} extends {@link NamedWriteable} to support | ||
| * cross-node distribution in PR 4/5. | ||
| */ | ||
| public interface ExternalSplit extends NamedWriteable { |
There was a problem hiding this comment.
I personally find the binding of the (current?) command name (EXTERNAL) to the concept a bit misleading. Also, "Split" seems a bit unclear, since we're then also having "Partitions" mentioned. "Split" doesn't seem very common in this work-division space - "slice", "partition", "chung", "shard", "batch"?
I assume this would apply to non-file-based connectors, like Arrow Flight, JDBC, (generic REST)?
There was a problem hiding this comment.
Naming is hard :)
I tend to use Partition as a concept and Split as the actual implementation for it; used ExternalSplit namely because it's related to External command (or datasources).
Shard = ES related
Slice = byte/integer/concrete data related
|
|
||
| String sourceType(); | ||
|
|
||
| default long estimatedSizeInBytes() { |
There was a problem hiding this comment.
This is required for file-based connectors, right?
There was a problem hiding this comment.
In most cases yet however it can also work on storages such as S3 or HTTP that can indicate the range size and thus influence the split size (and thus number).
…cription Developed using AI-assisted tooling
…bExpander Introduces PartitionFilterHintExtractor.Operator enum replacing raw string comparisons. Refactors rewriteGlobWithHints into smaller methods: indexRewritableHints and rewriteSegment. Collapses single-value IN with EQUALS (same glob output). Adds canRewriteGlob() and isSingleValue() to keep glob-rewriting logic out of string comparisons. Developed using AI-assisted tooling
Summary
Foundation for external source distribution (stage 1 of the multi-PR plan). Introduces
the core abstractions needed to parallelize and distribute external source reads across
nodes, along with Hive-style partition pruning at the glob-expansion layer.
ExternalSplit,FileSplit,SplitProvider,SplitDiscoveryContext): serializable work-unit abstraction for cross-node distribution (wired in PR 2)key=valuesegments from file paths, handles URL-encoded values, infers types viaStringUtils.parseIntegral/parseDouble=,!=,>,>=,<,<=,IN) fromFilternodes aboveUnresolvedExternalRelation; normalizesBytesRefvalues toStringPartitionMetadatathrough to split discoveryFileSplitas aNamedWriteableRelates #142996
Developed using AI-assisted tooling