-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Suggested by @crepererum in #4169 (comment)
Some systems such as IOx, store parquet files in a particular sorted order, and then uses the fact the data is sorted for a variety of sort related optimizations.
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.
The BasicEnforcement rule added in #4122 by @mingmwang allows DataFusion to take advantage of known information about the sort order.
One contrived example is if your parquet file is sorted by price and your query is select * from data order by price limit 10 datafusion can avoid scanning the entire file
Another more interesting example could be using sorted order to reorder pushdown filters or using a sort-merge-join without actually sorting
Describe the solution you'd like
- Expose
SortingColumnwhen reading and writing parquet metadata arrow-rs#3090 - Detect and use this sorted information when creating a ListingTable that reads from parquet files
Describe alternatives you've considered
Don't do it
Additional context
Here is a ticket that tracks allowing users of DataFusion to manually specify the sort order: #4169