ESQL: Add split coalescing for many small files#143335
Merged
costin merged 1 commit intoelastic:mainfrom Mar 2, 2026
Merged
Conversation
Collaborator
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
Collaborator
|
Hi @costin, I've created a changelog YAML for you. |
bpintea
approved these changes
Mar 1, 2026
Reduces scheduling overhead when querying thousands of tiny files (e.g. Iceberg micro-partitions) by grouping them into composite CoalescedSplit units. Uses greedy bin-packing by size when file sizes are known, count-based grouping otherwise. - CoalescedSplit: composite ExternalSplit wrapping child splits - SplitCoalescer: bin-packing logic with 32-split threshold - ExternalSourceOperatorFactory: expand CoalescedSplit children - AsyncExternalSourceOperatorFactory: recursive leaf expansion - ComputeService: coalesce after split discovery - EsqlPlugin: register CoalescedSplit NamedWriteable Developed using AI-assisted tooling
319724c to
13c023b
Compare
szybia
added a commit
to szybia/elasticsearch
that referenced
this pull request
Mar 2, 2026
…cations * upstream/main: (60 commits) Use batches for other bulk vector benchmarks (elastic#143167) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143388 Mute org.elasticsearch.snapshots.ConcurrentSnapshotsIT testBackToBackQueuedDeletes elastic#143387 [Inference API] Parse endpoint metadata from persisted endpoints (elastic#143081) Add cluster formation doc to DistributedArchitectureGuide (elastic#143318) Fix flattened root block loader null expectation (elastic#143238) Unmute ValueSourceReaderTypeConversionTests testLoadAll (elastic#143189) ESQL: Add split coalescing for many small files (elastic#143335) Unmute mixed-cluster spatial parse warning test (elastic#143186) Fix zero-size estimate in BytesRefBlock null test (elastic#143258) Make DataType and DataFormat top-level enums (elastic#143312) Add support for steps to change the target index name for later steps (elastic#142955) Set mayContainDuplicates flag to test deduplication (elastic#143375) ESQL: Fix Driver search load millis as nanos bug (elastic#143267) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.LookupJoinWithMixPushableAndUnpushableFilters} elastic#143378 ESQL: Forbid MV_EXPAND before full text functions (elastic#143249) ESQL: Fix unresolved name pattern (elastic#143210) Implement boxplot queryDSL aggregation for exponential_histograms (elastic#143026) Add prefetching to x64 bulk vector implementations (elastic#142387) Make large segment vector tests resilient to memory constraints (elastic#143366) ...
tballison
pushed a commit
to tballison/elasticsearch
that referenced
this pull request
Mar 3, 2026
Reduces scheduling overhead when querying thousands of tiny files (e.g. Iceberg micro-partitions) by grouping them into composite CoalescedSplit units. Uses greedy bin-packing by size when file sizes are known, count-based grouping otherwise. Currently, each file discovered by `FileSplitProvider` becomes an independent scheduling unit. For workloads with thousands of small files (common with Iceberg micro-partitions), this creates excessive per-split overhead in the slice queue, distribution strategy, and operator lifecycle. Coalescing groups small splits into composite units that are scheduled as one but expanded back into individual file reads at operator execution time. Coalescing is applied in `ComputeService.discoverSplits()` after `SplitDiscoveryPhase` completes, keeping the separation clean from the discovery logic. Developed using AI-assisted tooling Relates elastic#143329
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduces scheduling overhead when querying thousands of tiny files
(e.g. Iceberg micro-partitions) by grouping them into composite
CoalescedSplit units. Uses greedy bin-packing by size when file
sizes are known, count-based grouping otherwise.
Currently, each file discovered by
FileSplitProviderbecomes anindependent scheduling unit. For workloads with thousands of small
files (common with Iceberg micro-partitions), this creates excessive
per-split overhead in the slice queue, distribution strategy, and
operator lifecycle. Coalescing groups small splits into composite
units that are scheduled as one but expanded back into individual
file reads at operator execution time.
Coalescing is applied in
ComputeService.discoverSplits()afterSplitDiscoveryPhasecompletes, keeping the separation clean fromthe discovery logic.
Developed using AI-assisted tooling
Relates #143329