Skip to content

When using a local cluster, shuffle with disk #5502

@gjoseph92

Description

@gjoseph92

In feedback for dask/dask#8223, I've noticed users trying to do large shuffles on single-machine clusters: dask/dask#8223 (comment), dask/dask#8294 (comment).

When creating any default Client, the default shuffle mode automatically gets set to "tasks":

if set_as_default:
self._set_config = dask.config.set(
scheduler="dask.distributed", shuffle="tasks"
)

However, on a single machine, a disk-based shuffle is likely to be a lot more efficient, plus much lower load on the scheduler.

I think it would be better to keep using the disk-based shuffle if the Client is connected to a LocalCluster (not sure how to tell this). Most users don't know about the different shuffle modes, and shouldn't have to.

  • Client detects whether or not it is connected to a LocalCluster and sets the default shuffle config to disk
  • The detection should be based on detecting specifically LocalCluster, not on IP ranges or other means
  • Deprecation should be announced via documentation or a proper warning
  • Benchmark should exist verifying this is faster

Note:

  • This could be resolved at graph construction time when using HLG or HLE since the graph materialization is delayed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions