Skip to content

[DISCUSS] Improve client/scheduler performance during shuffling #6163

@rjzamora

Description

@rjzamora

Lets use this issue to coordinate some ongoing efforts to improve client/scheduler graph performance related to large-scale shuffle operations.

In order to rearrange data between partitions in dask.dataframe (for parallel merge/sort/shuffle routines), the rearrange_by_column_tasks routine is used to build a task graph for staged shuffling. Since this logic represents n log(n) scaling, the time required for graph creation and execution itself can be quite significant.

Note that a detailed explanation of a nearly identical "staged shuffle" is described in this discussion. One component of the algorithm that is clearly dominating the size of the graph is the repetition of shuffle_group tasks (which output dictionaries of pd/cudf DataFrame objects) and getitem tasks (which select elements of the shuffle-group output). It is my understanding that some people may have promising ideas to improve performance here.

cc @kkraus14 @quasiben @mrocklin (Please do cc others as well..)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions