Skip to content

[DISCUSSION] Column-based "repartitioning"/shuffling API in dask.dataframe #6133

@rjzamora

Description

@rjzamora

Background

The purpose of this issue is to align recent/ongoing API developement in dask_cudf with dask.dataframe. The dask.dataframe API currently provides two general mechanisms for moving data between partitions. These include:

  • repartition: Changes the partition boundaries without rearranging data within a given partition. Current options include divisions (change partition boundaries according to explicit index divisions), npartitions (split or concatenate each partition to change the total partition count), partition_size (split or concatenate each partition to reach a target byte size), and freq (used for datetime indices). In dask_cudf, there is an additional columns option, which effectively results in a hash-based shuffle operation (breaking with the convention of the other options to not rearrange data within partitions).

  • shuffle: Hash grouping of elements along an index. This does not preserve a meaningful divisions/partitioning, but does result in unique index values being in the same partition.

Discussion Question

I am wondering if there is interest in supporting the repartition(..., columns=) option in dask.dataframe, or if we should be removing the option from dask_cudf? I am aware that the same functionality can be achieved using shuffle(df, columns) from dask.dataframe.shuffle. However, the motivation for the API change in dask_cudf was to align with the simlar Spark API for "repartitioning".

Note that I am personally thinking that, due to the existing conventions in Dask, it probably makes the most sense to stick with the shuffle API for hash-based partitioning by column(s) (and to deprecate the repartition(..., columns=) option in dask_cudf). After all, this is Dask (and not Spark). However, if this is indeed the best path forward, is there any interest/support in adding a shuffle method to dask DataFrame to make the existing hash-based shuffling capability more visible to users?

cc @mrocklin @beckernick @VibhuJawa @ayushdg

Note: This question was originally discussed in #5741 and is the focus of the optimization in #6066

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataframediscussionDiscussing a topic with no specific actions yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions