Workloads with many small files (Iceberg micro-partitions, Firehose delivery streams) create excessive scheduling overhead — one split per file means thousands of splits competing for a handful of drivers. Meanwhile, distribution strategies ignore file size entirely, leading to unbalanced node assignments where one node gets a few large files and another gets thousands of tiny ones. This work adds split coalescing, size-aware distribution, and sub-file splitting to make external source execution efficient across diverse file layouts.
Split coalescing — Group many small splits into composite units so the scheduler processes them in batches rather than individually, reducing per-split overhead without changing operator semantics.
Cost-aware distribution — Use file size metadata to balance split assignments across nodes, ensuring each node receives roughly equal total work instead of equal split counts.
File splitting — Enable sub-file parallelism by splitting large files into byte-range chunks, so a single oversized file does not become a sequential bottleneck.
Workloads with many small files (Iceberg micro-partitions, Firehose delivery streams) create excessive scheduling overhead — one split per file means thousands of splits competing for a handful of drivers. Meanwhile, distribution strategies ignore file size entirely, leading to unbalanced node assignments where one node gets a few large files and another gets thousands of tiny ones. This work adds split coalescing, size-aware distribution, and sub-file splitting to make external source execution efficient across diverse file layouts.
Split coalescing — Group many small splits into composite units so the scheduler processes them in batches rather than individually, reducing per-split overhead without changing operator semantics.
Cost-aware distribution — Use file size metadata to balance split assignments across nodes, ensuring each node receives roughly equal total work instead of equal split counts.
File splitting — Enable sub-file parallelism by splitting large files into byte-range chunks, so a single oversized file does not become a sequential bottleneck.