Should we deprecate aggregate_files from read_parquet?

This issue is similar to #8937 (already done) and #9043 in the sense that it aims to remove unnecessary (and rarely-utilized) options from `dd.read_parquet`.

TLDR: I’d like to propose that we deprecate the `aggregate_files` argument from `dd.read_parquet`.

Although I do believe there is strong motivation for a file-aggregation feature (especially for hive/directory-partitioned datasets), the current implementation of this feature actually aggregates row-groups (rather than files), which is extremely inefficient at scale (or on remote storage). This implementation “snafu” is my own fault. However, rather than directly changing the current behavior, I suggest that we simply remove the option altogether.  Removing both `aggregate_files` and `chunksize` (see #9043) should allow us to cut out a lot of unnecessarily-complex core/engine code and reduce general maintenance burden.

In the future, we may wish to re-introduce an `aggregate_files`-like feature, but that (simpler) feature should be designed to aggregate full files (rather than arbitrary row-groups). Users (or down-stream libraries) that need more flexibility than simple “per-row-groups” or “per-files” partitioning should be able to feed their own custom logic into the new `from_map` API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should we deprecate aggregate_files from read_parquet? #9051

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Should we deprecate aggregate_files from read_parquet? #9051

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions