-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Background
The purpose of this issue is to align recent/ongoing API developement in dask_cudf with dask.dataframe. The dask.dataframe API currently provides two general mechanisms for moving data between partitions. These include:
-
repartition: Changes the partition boundaries without rearranging data within a given partition. Current options includedivisions(change partition boundaries according to explicit index divisions),npartitions(split or concatenate each partition to change the total partition count),partition_size(split or concatenate each partition to reach a target byte size), andfreq(used for datetime indices). Indask_cudf, there is an additionalcolumnsoption, which effectively results in a hash-based shuffle operation (breaking with the convention of the other options to not rearrange data within partitions). -
shuffle: Hash grouping of elements along an
index. This does not preserve a meaningful divisions/partitioning, but does result in uniqueindexvalues being in the same partition.
Discussion Question
I am wondering if there is interest in supporting the repartition(..., columns=) option in dask.dataframe, or if we should be removing the option from dask_cudf? I am aware that the same functionality can be achieved using shuffle(df, columns) from dask.dataframe.shuffle. However, the motivation for the API change in dask_cudf was to align with the simlar Spark API for "repartitioning".
Note that I am personally thinking that, due to the existing conventions in Dask, it probably makes the most sense to stick with the shuffle API for hash-based partitioning by column(s) (and to deprecate the repartition(..., columns=) option in dask_cudf). After all, this is Dask (and not Spark). However, if this is indeed the best path forward, is there any interest/support in adding a shuffle method to dask DataFrame to make the existing hash-based shuffling capability more visible to users?
cc @mrocklin @beckernick @VibhuJawa @ayushdg
Note: This question was originally discussed in #5741 and is the focus of the optimization in #6066