[DISCUSSION] Column-based "repartitioning"/shuffling API in dask.dataframe

**Background**

The purpose of this issue is to align recent/ongoing API developement in `dask_cudf` with `dask.dataframe`.  The `dask.dataframe` API currently provides two general mechanisms for moving data between partitions.  These include:

-  `repartition`:  Changes the partition boundaries *without rearranging data within a given partition*.  Current options include `divisions` (change partition boundaries according to explicit index divisions), `npartitions` (split or concatenate each partition to change the total partition count), `partition_size` (split or concatenate each partition to reach a target byte size), and `freq` (used for datetime indices).   In `dask_cudf`, there is an additional `columns` option, which effectively results in a hash-based shuffle operation (breaking with the convention of the other options to  **not** rearrange data within partitions).

- shuffle:  Hash grouping of elements along an `index`.  This does not preserve a meaningful divisions/partitioning, but does result in unique `index` values being in the same partition.

**Discussion Question**

I am wondering if there is interest in supporting the `repartition(..., columns=)` option in `dask.dataframe`, or if we should be removing the option from `dask_cudf`?   I am aware that the same functionality can be achieved using `shuffle(df, columns)` from `dask.dataframe.shuffle`.  However, the motivation for the API change in `dask_cudf` was to align with the simlar [Spark API for "repartitioning"](https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/dataframe.html#DataFrame.repartition).

Note that I am personally thinking that, due to the existing conventions in Dask, it probably makes the most sense to stick with the `shuffle` API for hash-based partitioning by column(s) (and to deprecate the `repartition(..., columns=)`  option in `dask_cudf`).  After all, this is Dask (and not Spark).  However, if this is indeed the best path forward, is there any interest/support in adding a `shuffle` method to dask `DataFrame` to make the existing hash-based shuffling capability more *visible* to users?

cc @mrocklin @beckernick @VibhuJawa @ayushdg 

*Note*:  This question was originally discussed in [#5741](https://github.com/dask/dask/pull/5741#issuecomment-608582321) and is the focus of the optimization in #6066 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSSION] Column-based "repartitioning"/shuffling API in dask.dataframe #6133

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[DISCUSSION] Column-based "repartitioning"/shuffling API in dask.dataframe #6133

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions