Skip to content

Should we deprecate aggregate_files from read_parquet? #9051

@rjzamora

Description

@rjzamora

This issue is similar to #8937 (already done) and #9043 in the sense that it aims to remove unnecessary (and rarely-utilized) options from dd.read_parquet.

TLDR: I’d like to propose that we deprecate the aggregate_files argument from dd.read_parquet.

Although I do believe there is strong motivation for a file-aggregation feature (especially for hive/directory-partitioned datasets), the current implementation of this feature actually aggregates row-groups (rather than files), which is extremely inefficient at scale (or on remote storage). This implementation “snafu” is my own fault. However, rather than directly changing the current behavior, I suggest that we simply remove the option altogether. Removing both aggregate_files and chunksize (see #9043) should allow us to cut out a lot of unnecessarily-complex core/engine code and reduce general maintenance burden.

In the future, we may wish to re-introduce an aggregate_files-like feature, but that (simpler) feature should be designed to aggregate full files (rather than arbitrary row-groups). Users (or down-stream libraries) that need more flexibility than simple “per-row-groups” or “per-files” partitioning should be able to feed their own custom logic into the new from_map API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframeioneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions