-
-
Notifications
You must be signed in to change notification settings - Fork 757
Open
Description
In feedback for dask/dask#8223, I've noticed users trying to do large shuffles on single-machine clusters: dask/dask#8223 (comment), dask/dask#8294 (comment).
When creating any default Client, the default shuffle mode automatically gets set to "tasks":
distributed/distributed/client.py
Lines 727 to 730 in 69814b4
| if set_as_default: | |
| self._set_config = dask.config.set( | |
| scheduler="dask.distributed", shuffle="tasks" | |
| ) |
However, on a single machine, a disk-based shuffle is likely to be a lot more efficient, plus much lower load on the scheduler.
I think it would be better to keep using the disk-based shuffle if the Client is connected to a LocalCluster (not sure how to tell this). Most users don't know about the different shuffle modes, and shouldn't have to.
Clientdetects whether or not it is connected to aLocalClusterand sets the default shuffle config to disk- The detection should be based on detecting specifically LocalCluster, not on IP ranges or other means
- Deprecation should be announced via documentation or a proper warning
- Benchmark should exist verifying this is faster
Note:
- This could be resolved at graph construction time when using HLG or HLE since the graph materialization is delayed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels