[DISCUSS] Improve client/scheduler performance during shuffling

Lets use this issue to coordinate some ongoing efforts to improve client/scheduler graph performance related to large-scale shuffle operations.

In order to rearrange data between partitions in `dask.dataframe` (for parallel merge/sort/shuffle routines), the `rearrange_by_column_tasks` routine is used to build a task graph for staged shuffling.  Since this logic represents `n log(n)` scaling, the time required for graph creation and execution itself can be quite significant.

Note that a detailed explanation of a nearly identical "staged shuffle" is described [in this discussion](https://github.com/rapidsai/cudf/pull/4308#issuecomment-598219871).  One component of the algorithm that is clearly dominating the size of the graph is the repetition of `shuffle_group` tasks (which output dictionaries of pd/cudf DataFrame objects) and `getitem` tasks (which select elements of the `shuffle-group` output).  It is my understanding that some people may have promising ideas to improve performance here.

cc @kkraus14 @quasiben @mrocklin (Please do cc others as well..)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSS] Improve client/scheduler performance during shuffling #6163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[DISCUSS] Improve client/scheduler performance during shuffling #6163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions