-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
This issue is similar to #8937 (already done) and #9043 in the sense that it aims to remove unnecessary (and rarely-utilized) options from dd.read_parquet.
TLDR: I’d like to propose that we deprecate the aggregate_files argument from dd.read_parquet.
Although I do believe there is strong motivation for a file-aggregation feature (especially for hive/directory-partitioned datasets), the current implementation of this feature actually aggregates row-groups (rather than files), which is extremely inefficient at scale (or on remote storage). This implementation “snafu” is my own fault. However, rather than directly changing the current behavior, I suggest that we simply remove the option altogether. Removing both aggregate_files and chunksize (see #9043) should allow us to cut out a lot of unnecessarily-complex core/engine code and reduce general maintenance burden.
In the future, we may wish to re-introduce an aggregate_files-like feature, but that (simpler) feature should be designed to aggregate full files (rather than arbitrary row-groups). Users (or down-stream libraries) that need more flexibility than simple “per-row-groups” or “per-files” partitioning should be able to feed their own custom logic into the new from_map API.