[Never Merge] Prototype for scalable dataframe shuffle by gjoseph92 · Pull Request #8223 · dask/dask

gjoseph92 · 2021-10-05T20:08:42Z

This is a duplicate of #8209, but I'm taking over from @fjetter since I may want to occasionally push small fixes here. Original message copied below, with edits.

This is a prototype implementation of a new dask.dataframe shuffle algorithm which runs on a dask.distributed cluster. Different to the task based shuffle algorithm this uses an out-of-band communication and administration approach to circumvent scheduler imposed bottlenecks.

How to try / Feedback requested

Install this branch of dask/dask (pip install -U git+https://github.com/gjoseph92/dask@shuffle_service) and run a shuffle workload (set_index, groupby, etc.), passing the keyword argument shuffle="service". Until distributed 2021.10.0 or later is released, you'll also need to install dask/distributed from main (pip install -U git+https://github.com/dask/distributed).

With this PR, we've been able to easily do shuffles that crash the cluster currently. Additionally, since this writes intermediate data to disk, you can shuffle larger-than-memory DataFrames. Note that the data written to disk won't show up as spilled-to-disk on the dashboard. Similarly, you'll see high unmanged memory on workers while the shuffle is working.

As a rule of thumb:

total cluster disk space needs to fit the full dataset
worker RAM needs to fit ~2GiB + partition_size * nthreads * ~5x safety factor?

Additionally, more threads don't improve performance much (since everything is GIL-bound), so we recommend 2 threads unless other parts of your workload require more.

We would love you to try this out and report back to us. This implementation is targeted for large scale data processing and we would appreciate people trying this out and giving us feedback about it. Especially if you have large datasets sitting around. If you encounter any stability or performance related issues, please open a dedicated ticket and link to this PR such that we can structure discussions a bit.

⚠️ Warnings ⚠️

This is experimental. We do not expect this PR to ever be merged. Instead, we'll take ideas (and feedback) from this PR into a different one that's better-designed, stable, and maintainable.

With that explained, here are things to look out for:

Doesn't work for merge yet, because that requires multiple simultaneous shuffles
Requires distributed>=2021.10.0 which doesn't exist yet, so until it does, you need to install distributed from main
All workers must have >= 2 threads
The cluster must have enough total disk space to hold the entire dataset (but can have much less RAM than that)
If a worker runs out of disk space, the whole shuffle will error
Workers sometimes run out of memory and die randomly during the transfer phase
If a worker dies during the transfer phase, the cluster will deadlock for 15 minutes typically (distributed's 300s connect timeout * 3 retries), then the task will error
If a worker dies during the unpack phase, the cluster will deadlock indefinitely
Multiple shuffles at the same time will fail in strange ways. Running a shuffle more than once on a cluster without restarting it could possibly behave oddly too.
Mostly tested on synthetic data from dask.datasets.timeseries. Real data with uneven distributions and input partition sizes may behave poorly.
It's slower than it should be and mostly GIL-bound (though probably still faster than a standard task-based shuffle): Testing network performance distributed#5258 (comment), Dask shuffle performance help pandas-dev/pandas#43155 (comment)
Remember to pass shuffle="service"!

Reviews

For all who are brave enough to review this I would only encourage a high level pass. There are many moving parts and many open TODOs. We're discussing breaking off some parts of the implementation to allow for easier review (or move some parts to dask/distributed). This is still TBD but suggestions are welcome.

High level design

The concurrency model driving this is rather complex and is made of multiple coroutines and threads to deal with grouping, concatenating, sending and receiving data. This process is kicked off in the transfer task which is applied on every input partition. This allows computation and network to efficiently overlap. Data is buffered efficiently such that network overhead for small sized data chunks, shards, is minimal.

The receiving end of these submissions is a small extension on the Worker which accepts incoming data and caches it (on disk, see below) for later processing. The task graph currently employs a barrier task for synchronization and buffer flushing. The output partitions will then be picked up by the unpack task which collects the data stored on the given worker and extracts it into a runnable task. From there on, everything is BAU.

To enable larger than (cluster) memory dataset shuffles there is an efficient spill to disk implementation which caches all received shards on disk while the shuffle is still running. This is currently not optional. There is currently no persistence hierarchy implemented as is usual for a Worker holding data.

References

cc @mrocklin , @gjoseph92 , @quasiben , @madsbk , ...?

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

I'd thought we'd see the line number in the dask error but apparently not

At least it fails fast now, instead of deadlocking

FredericOdermatt · 2021-11-05T04:00:41Z

I am working with 450 million rows and have to set a new index for a group-by operation on that dataset. The original data is 23GiB when saved as parquet files and doing the .set_index() operation was eating all 390GiB of available RAM space on the local multi-cpu cluster I am using. After changing to this PR and running with shuffle="service" the shuffle on the 450 million rows (41 million unique values in new index) completed in 30 minutes on 30 process workers while never using more than 50 GiB of RAM. Therefore my review for this PR is very positive. I just saw this and applied it and it worked great for me.

gjoseph92 · 2021-11-05T04:07:02Z

Great to hear, thanks @FredericOdermatt! I'm curious, did you ever try shuffle="disk"? This PR is primarily designed for multi-machine clusters, so on a single machine, the disk-based shuffle should work similarly to this one. I'm curious if one works better than the other.

bsesar · 2021-12-01T14:39:39Z

Great to hear, thanks @FredericOdermatt! I'm curious, did you ever try shuffle="disk"? This PR is primarily designed for multi-machine clusters, so on a single machine, the disk-based shuffle should work similarly to this one. I'm curious if one works better than the other.

@gjoseph92, I have found that merging is much slower when using shuffle='disk' (#5554).

terramars · 2022-08-03T14:15:47Z

I tried this and got :

distributed.worker - WARNING - Compute Failed
Function: unpack
args: (<dask.dataframe.shuffle_service.ShuffleService object at 0x7fb5bf082dd0>, 0, None)
kwargs: {}
Exception: 'AttributeError("'ShuffleService' object has no attribute 'retrieve_futures'")'

While trying to parquetize the output of the sort.

mrocklin and others added 3 commits October 1, 2021 17:59

Prototype for scalable dataframe shuffle

9ac798f

Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>

Add messages for asserts

47dd955

I'd thought we'd see the line number in the dask error but apparently not

hopefully resolve file_amount_written mismatch

91bc9f6

github-actions bot added the dataframe label Oct 5, 2021

gjoseph92 mentioned this pull request Oct 5, 2021

WIP Prototype for scalable dataframe shuffle #8209

Closed

gjoseph92 marked this pull request as draft October 5, 2021 20:10

charlesbluca mentioned this pull request Oct 6, 2021

ReadTheDocs PR builds are failing #8227

Closed

fjetter mentioned this pull request Oct 8, 2021

to_parquet with uint64 index causes workers to die, local scheduler works fine dask/distributed#5400

Closed

gjoseph92 added 3 commits October 8, 2021 13:07

fixup! Prototype for scalable dataframe shuffle

d5e78ac

Block merges for now

7a580e5

Error on concurrent shuffles

fd1b02b

At least it fails fast now, instead of deadlocking

GenevieveBuckley mentioned this pull request Oct 14, 2021

Dask crashes or hangs during out-of-core dataframes sort #7613

Open

gjoseph92 mentioned this pull request Oct 18, 2021

[DNM] Peer-to-peer shuffle design dask/distributed#5435

Draft

DahnJ mentioned this pull request Oct 25, 2021

Shuffle prototype: Feedback (disk usage + workers dying) #8294

Open

gjoseph92 force-pushed the shuffle_service branch from abb6408 to fd1b02b Compare October 27, 2021 21:12

gjoseph92 mentioned this pull request Oct 28, 2021

[DNM] Scatter shuffle proof-of-concept dask/distributed#5473

Closed

fjetter mentioned this pull request Oct 28, 2021

Memory leak in gather() dask/distributed#5430

Open

gjoseph92 mentioned this pull request Nov 5, 2021

When using a local cluster, shuffle with disk dask/distributed#5502

Open

Merge remote-tracking branch 'upstream/main' into shuffle_service

b4107d7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Never Merge] Prototype for scalable dataframe shuffle#8223

[Never Merge] Prototype for scalable dataframe shuffle#8223
gjoseph92 wants to merge 7 commits intodask:mainfrom
gjoseph92:shuffle_service

gjoseph92 commented Oct 5, 2021 •

edited

Loading

Uh oh!

FredericOdermatt commented Nov 5, 2021

Uh oh!

gjoseph92 commented Nov 5, 2021

Uh oh!

bsesar commented Dec 1, 2021

Uh oh!

terramars commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

gjoseph92 commented Oct 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to try / Feedback requested

⚠️ Warnings ⚠️

Reviews

High level design

References

Uh oh!

FredericOdermatt commented Nov 5, 2021

Uh oh!

gjoseph92 commented Nov 5, 2021

Uh oh!

bsesar commented Dec 1, 2021

Uh oh!

terramars commented Aug 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gjoseph92 commented Oct 5, 2021 •

edited

Loading