Skip to content

Improve performance of list_files_for_scan when not collecting statistics #9219

@matthewmturner

Description

@matthewmturner

Is your feature request related to a problem or challenge?

Right now DataFusions planning performance is a bottleneck for our application. We noticed that there is a non-negligible amount of work being done in get_statistics_with_limit even when collecting statistics is disabled. In particular the work done within multiunzip which constructs the ColumnStatistics is what we view as being unnecessary when collecting statistics is disabled. We would like to add some logic to improve performance when collect statistics is disabled.

Describe the solution you'd like

On the list_files_for_scan method of ListingTable we would like to update the logic for getting the list of PartitionedFiles based on the value of self.options.collect_stat.

So going from

let (files, statistics) = get_statistics_with_limit(files, self.schema(), limit).await?;

to something like

let (files, statistics) = match self.options.collect_stat {
    true => get_statistics_with_limit(files, self.schema(), limit).await?,
    false => get_files_with_unknown_stats(files, self.schema(), limit).await?
}

Where get_files_with_unknown_stats avoids the call to multiunzup.

Describe alternatives you've considered

An alternative approach could be adding a parameter to get_statistics_with_limit for collect_stats and calling multiunzip based on that.

Additional context

Our application has low latency requirement and in our current setup DataFusion's planning performance is our bottleneck. We will eventually be turning on statistics collection but right now we cant and so we are looking to improve planning performance where we can.

Based on our internal benchmarks we saw physical planning performance improve by 16-43% after making the above mentioned change (we have a lot of files so impact can be large, for smaller number of files the impact will probably not be as big or could be neglible).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions