-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Lets use this issue to coordinate some ongoing efforts to improve client/scheduler graph performance related to large-scale shuffle operations.
In order to rearrange data between partitions in dask.dataframe (for parallel merge/sort/shuffle routines), the rearrange_by_column_tasks routine is used to build a task graph for staged shuffling. Since this logic represents n log(n) scaling, the time required for graph creation and execution itself can be quite significant.
Note that a detailed explanation of a nearly identical "staged shuffle" is described in this discussion. One component of the algorithm that is clearly dominating the size of the graph is the repetition of shuffle_group tasks (which output dictionaries of pd/cudf DataFrame objects) and getitem tasks (which select elements of the shuffle-group output). It is my understanding that some people may have promising ideas to improve performance here.
cc @kkraus14 @quasiben @mrocklin (Please do cc others as well..)