-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
Right now DataFusions planning performance is a bottleneck for our application. We noticed that there is a non-negligible amount of work being done in get_statistics_with_limit even when collecting statistics is disabled. In particular the work done within multiunzip which constructs the ColumnStatistics is what we view as being unnecessary when collecting statistics is disabled. We would like to add some logic to improve performance when collect statistics is disabled.
Describe the solution you'd like
On the list_files_for_scan method of ListingTable we would like to update the logic for getting the list of PartitionedFiles based on the value of self.options.collect_stat.
So going from
let (files, statistics) = get_statistics_with_limit(files, self.schema(), limit).await?;to something like
let (files, statistics) = match self.options.collect_stat {
true => get_statistics_with_limit(files, self.schema(), limit).await?,
false => get_files_with_unknown_stats(files, self.schema(), limit).await?
}Where get_files_with_unknown_stats avoids the call to multiunzup.
Describe alternatives you've considered
An alternative approach could be adding a parameter to get_statistics_with_limit for collect_stats and calling multiunzip based on that.
Additional context
Our application has low latency requirement and in our current setup DataFusion's planning performance is our bottleneck. We will eventually be turning on statistics collection but right now we cant and so we are looking to improve planning performance where we can.
Based on our internal benchmarks we saw physical planning performance improve by 16-43% after making the above mentioned change (we have a lot of files so impact can be large, for smaller number of files the impact will probably not be as big or could be neglible).