ESQL: Add split coalescing for many small files by costin · Pull Request #143335 · elastic/elasticsearch

costin · 2026-02-28T12:57:19Z

Reduces scheduling overhead when querying thousands of tiny files
(e.g. Iceberg micro-partitions) by grouping them into composite
CoalescedSplit units. Uses greedy bin-packing by size when file
sizes are known, count-based grouping otherwise.

Currently, each file discovered by FileSplitProvider becomes an
independent scheduling unit. For workloads with thousands of small
files (common with Iceberg micro-partitions), this creates excessive
per-split overhead in the slice queue, distribution strategy, and
operator lifecycle. Coalescing groups small splits into composite
units that are scheduled as one but expanded back into individual
file reads at operator execution time.

Coalescing is applied in ComputeService.discoverSplits() after
SplitDiscoveryPhase completes, keeping the separation clean from
the discovery logic.

Developed using AI-assisted tooling

Relates #143329

elasticsearchmachine · 2026-02-28T12:57:43Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2026-02-28T12:58:05Z

Hi @costin, I've created a changelog YAML for you.

Reduces scheduling overhead when querying thousands of tiny files (e.g. Iceberg micro-partitions) by grouping them into composite CoalescedSplit units. Uses greedy bin-packing by size when file sizes are known, count-based grouping otherwise. - CoalescedSplit: composite ExternalSplit wrapping child splits - SplitCoalescer: bin-packing logic with 32-split threshold - ExternalSourceOperatorFactory: expand CoalescedSplit children - AsyncExternalSourceOperatorFactory: recursive leaf expansion - ComputeService: coalesce after split discovery - EsqlPlugin: register CoalescedSplit NamedWriteable Developed using AI-assisted tooling

…cations * upstream/main: (60 commits) Use batches for other bulk vector benchmarks (elastic#143167) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.MvJoinKeyOnTheLookupIndexAfterStats} elastic#143388 Mute org.elasticsearch.snapshots.ConcurrentSnapshotsIT testBackToBackQueuedDeletes elastic#143387 [Inference API] Parse endpoint metadata from persisted endpoints (elastic#143081) Add cluster formation doc to DistributedArchitectureGuide (elastic#143318) Fix flattened root block loader null expectation (elastic#143238) Unmute ValueSourceReaderTypeConversionTests testLoadAll (elastic#143189) ESQL: Add split coalescing for many small files (elastic#143335) Unmute mixed-cluster spatial parse warning test (elastic#143186) Fix zero-size estimate in BytesRefBlock null test (elastic#143258) Make DataType and DataFormat top-level enums (elastic#143312) Add support for steps to change the target index name for later steps (elastic#142955) Set mayContainDuplicates flag to test deduplication (elastic#143375) ESQL: Fix Driver search load millis as nanos bug (elastic#143267) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:lookup-join.LookupJoinWithMixPushableAndUnpushableFilters} elastic#143378 ESQL: Forbid MV_EXPAND before full text functions (elastic#143249) ESQL: Fix unresolved name pattern (elastic#143210) Implement boxplot queryDSL aggregation for exponential_histograms (elastic#143026) Add prefetching to x64 bulk vector implementations (elastic#142387) Make large segment vector tests resilient to memory constraints (elastic#143366) ...

Reduces scheduling overhead when querying thousands of tiny files (e.g. Iceberg micro-partitions) by grouping them into composite CoalescedSplit units. Uses greedy bin-packing by size when file sizes are known, count-based grouping otherwise. Currently, each file discovered by `FileSplitProvider` becomes an independent scheduling unit. For workloads with thousands of small files (common with Iceberg micro-partitions), this creates excessive per-split overhead in the slice queue, distribution strategy, and operator lifecycle. Coalescing groups small splits into composite units that are scheduled as one but expanded back into individual file reads at operator execution time. Coalescing is applied in `ComputeService.discoverSplits()` after `SplitDiscoveryPhase` completes, keeping the separation clean from the discovery logic. Developed using AI-assisted tooling Relates elastic#143329

costin added >enhancement :Analytics/ES|QL AKA ESQL labels Feb 28, 2026

costin requested a review from bpintea February 28, 2026 12:57

elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0 labels Feb 28, 2026

bpintea approved these changes Mar 1, 2026

View reviewed changes

costin mentioned this pull request Mar 1, 2026

ESQL: Add split coalescing, cost-aware distribution, and file splitting for external sources #143329

Closed

3 tasks

costin force-pushed the ws-c/split-coalescing branch from 319724c to 13c023b Compare March 2, 2026 11:59

costin enabled auto-merge (squash) March 2, 2026 13:22

costin merged commit d6bcc06 into elastic:main Mar 2, 2026
35 checks passed

costin deleted the ws-c/split-coalescing branch March 2, 2026 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Add split coalescing for many small files#143335

ESQL: Add split coalescing for many small files#143335
costin merged 1 commit intoelastic:mainfrom
costin:ws-c/split-coalescing

costin commented Feb 28, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Feb 28, 2026

Uh oh!

elasticsearchmachine commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

costin commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Feb 28, 2026

Uh oh!

elasticsearchmachine commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

costin commented Feb 28, 2026 •

edited

Loading