-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
Loading metadata from parquet files in the client can be costly; parsing the statistics and looking for ordering can also be costly.
- We should limit when we load metadata at all, either from a global file, otherwise from the first data file
- If the pandas metadata says there is an index, or the user specifies an index, we need to parse the statistics for that column only; do not try to guess an index
- While parsing, we can know early that ordering won't work, so bail early
As stated, this implies that any filters will be applied at read time, so we don't proactively cut down on partitions, but some partitions might come back empty (but the work is done on the worker).
Reactions are currently unavailable