[REVIEW] Generalize rearrange_by_column_tasks and add DataFrame.shuffle by rjzamora · Pull Request #6066 · dask/dask

rjzamora · 2020-04-04T02:20:03Z

Following the discussion in #5741 (particularly in/after this comment), this draft PR generalizes the rearrange_by_column_tasks implementation to handle hash-based shuffling without the existence of a "_partitions" column (including support for multiple shuffle-index columns).

TODO:

Include benchmarking numbers for ~~both cudf and~~ pandas-backed operations
Tests added / passed
Passes black dask / flake8 dask

mrocklin · 2020-04-21T14:37:39Z

It looks like hasn't received any review over thepast few weeks. My apologies @rjzamora . Is this still live ?

rjzamora · 2020-04-21T15:05:12Z

It looks like hasn't received any review over thepast few weeks. My apologies @rjzamora . Is this still live ?

Thanks for the ping @mrocklin - I have been meaning to revisit and revise this.

Overall, it would be wonderful to have a generalized rearrange_by_column_tasks that will actually handle more than one column (and simply perform a hash in both shuffle_group and shuffle_group_2 in lieu of using a "_partitions") column. I feel that dask_cudf is currently duplicating code just to avoid creating/assigning the "_partitions" column. We would definitely prefer to use upstream-Dask for all shuffling. Although the simplicity of the single-column routine is likely best for pandas-backed DataFrame shuffling, we are really hoping to introduce a minimal amount of flexibility here. Since device memory is a valuable resource, we are doing our best to shed every byte we can (to reduce the amount of data we need to spill to host memory).

It would also be nice to introduce a mechanism to use the above memory optimization in the merge code path. However, I certainly understand if you would rather dask_cudf implement/maintain it's own wrapper to avoid "_partition" creation.

rjzamora · 2020-04-27T17:04:51Z

Following the comments in #6133 , I decided to add the DataFrame.shuffle API in this PR. (cc @TomAugspurger )

In order to allow dask_cudf to minimize the cost of the "_partitions" column creation/assignment (for workflows with high memory pressure), I added the shuffle_dtype argument in a few places. This argument allows the user to explicitly specify the type of the "_partitions" column. If this argument is set to False, "_partitions" will not be created at all (and hashing will be performed within all local shuffle-group operations).

TomAugspurger

Going through the shuffle_dtype changes now. It seems like an implementation detail that's leaking through to the public API. Is there any reason to not use shuffle_dtype=False when using hash-based partitioning?

dask/dataframe/core.py

TomAugspurger · 2020-04-27T18:11:23Z

dask/dataframe/core.py

+        """ Rearrange DataFrame into new partitions by index
+
+        Uses hashing to map rows to output partitions. After this operation,
+        rows with the same index element(s) will be in the same partition.


It's implied, but maybe mention that the result will have unknown divisions?

Ah, I see you mention that in the Notes. That's fine too.

dask/dataframe/core.py

rjzamora · 2020-04-27T22:39:33Z

Thanks for reviewing Tom!

Is there any reason to not use shuffle_dtype=False when using hash-based partitioning?

There be cases in which it is more performant to create the "_partitions" column, but I haven't experienced this. It seems likely that the "_partitions"-based workflow was used to align the shuffle algorithm to be used cleanly wit set_index. In my experience it is a significant memory and communication benefit to avoid adding an additional column (hence my motivation for this PR).

Simple benchmark as motivation:

from dask.distributed import LocalCluster, Client, wait
from dask.datasets import timeseries

cluster = LocalCluster(n_workers=8)
client = Client(cluster)
ddf = timeseries(start='2000-01-01', end='2000-12-31', partition_freq='1d')

# Default
%timeit wait(ddf.shuffle("id", shuffle="tasks", shuffle_dtype=None).persist())

# Minimal dtype
%timeit wait(ddf.shuffle("id", shuffle="tasks", shuffle_dtype="uint16").persist())

# No "_partitions"
%timeit wait(ddf.shuffle("id", shuffle="tasks", shuffle_dtype=False).persist())

Default

18.4 s ± 241 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Minimal dtype

19 s ± 247 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

No "_partitions"

16.8 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am happy to avoid the creation of "_partitions" (whenever possible) if it seems reasonable to others :)

TomAugspurger · 2020-04-28T11:49:09Z

Thanks. I noticed that there's a decent amount of variability in the size of the hashed partitions

Some of that is expected, since the id column isn't uniform. But there's perhaps more unevenness than I'd expect. For example, partition 214 has values from id = [983, 982, 1046].

rjzamora · 2020-04-29T15:03:06Z

Some of that is expected, since the id column isn't uniform. But there's perhaps more unevenness than I'd expect. For example, partition 214 has values from id = [983, 982, 1046].

Good observation @TomAugspurger - We definitley do not get balanced partitions when the number of uniques in the on column(s) is not much larger than the number of output partitions. In the example above, the number of output partitions is likely to be in the same ballpark as the number of unique values in "id":

print(ddf.npartitions, len(ddf["id"].unique()))

Output: 365 321 (Of course, the second number will vary)

For this reason, I think it makes sense that the output will be unbalanced. You will see much better balance if you do something like:

ddf = timeseries(start='2000-01-01', end='2000-12-31', partition_freq='1d', id_lam=100_000_000)

Perhaps we should add a note to the DataFrame.shuffle docstring that the index must have many uniques to produced balanced partitions?

fjetter

I agree with the remark of @TomAugspurger about the shuffle_dtype as part of the public API and if it is actually included in the public API there should also be a remark about which scenarios would benefit from creating the column. I'm struggling to come up with a scenario which would benefit from this situation. If this is very rare, I would suggest to not expose this parameter (internally it's fine if it makes a difference e.g. for set_index)

@TomAugspurger the non-uniformity of the buckets is something which didn't change here, did it? The hash bucketing logic is effectively the same, isn't it?

rjzamora · 2020-04-30T14:35:14Z

I agree with the remark of @TomAugspurger about the shuffle_dtype as part of the public API and if it is actually included in the public API there should also be a remark about which scenarios would benefit from creating the column. I'm struggling to come up with a scenario which would benefit from this situation. If this is very rare, I would suggest to not expose this parameter (internally it's fine if it makes a difference e.g. for set_index)

Agreed - The cleanest change is probably to simplify the public API, and to always avoid creating the "_partitions" column in dask.dataframe.shuffle.shuffle. The changes to rearrange_by_column_tasks in this PR will still allow for the "_partitions"-based approach in set_index (and sort_values when added).

@TomAugspurger the non-uniformity of the buckets is something which didn't change here, did it? The hash bucketing logic is effectively the same, isn't it?

Right - These changes will not actually change the hash statistics. It just changes when hashing is performed (and exposes a new DataFrame.shuffle method)

rjzamora · 2020-04-30T15:29:49Z

Note: I will need to revise the changes here a bit to align with #6137

… shuffle

rjzamora · 2020-05-11T16:04:08Z

Update:

Removed shuffle_dtype. Will always avoid creating "_partitions" when the user is passing in a column name (or list of column names) to DataFrame.shuffle.
Added the _simple_rearrange_by_column_tasks code path for the case that the output partition count is small (less than max_branch). The primary motivation is the case that the output partition count is small, but different from the input partition count. Unless I am misunderstanding, the current algorithm will first perform an df.npartition-to-df.npartition shuffle, and then repartition. A simpler algorithm can do this all in a single shuffle.

dask/dataframe/tests/test_merge_column_and_index.py

TomAugspurger

Thanks for working on this. Overall it's looking nice.

IIUC, this hasn't changed the data model at all: there's no indication on the object that the dataframe is now partitioned by one or more columns. Should we? For index-based partitioning, we have .divisions, and known divisions are reflected in the repr. Or perhaps that doesn't make sense to add, since we don't know the hashed values like we know the divisions? I haven't thought through this fully.

dask/dataframe/core.py

dask/dataframe/shuffle.py

dask/dataframe/core.py

dask/dataframe/shuffle.py

rjzamora · 2020-05-21T15:12:24Z

IIUC, this hasn't changed the data model at all: there's no indication on the object that the dataframe is now partitioned by one or more columns. Should we? For index-based partitioning, we have .divisions, and known divisions are reflected in the repr. Or perhaps that doesn't make sense to add, since we don't know the hashed values like we know the divisions? I haven't thought through this fully.

Right - I do think it ultimately makes sense to expand the dask.dataframe data model to allow the divisions attribute to be represented as a pd.DataFrame-like object so that the divisions can can correspond to one or more columns (including or not including the index). With that said, the shuffle method added here is not resulting in an "order"-based partitioning, so the same concept of "divisions" is not really useful (unless we are interested in eventually storing information beyond lexicographical ordering in the divisions - which seems like a stretch to me)

TomAugspurger · 2020-05-21T16:02:12Z

Yeah, agreed that we can't have a .divisions-like attribute for column shuffle-partitioned datasets. Just wondering if the fact that it's partitioned on that column's values should be reflected in the repr. But that can wait till later!

TomAugspurger

This looks good.

@rjzamora do you have strong thoughts on if we should / how to expose (hash-based) column partioning in the data model? I'd like to have an issue to collect discussion on this topic, but I'm still working through it in my head. If you don't have strong thoughts then I'll assign myself a task to write up an issue.

rjzamora · 2020-05-26T14:24:30Z

@rjzamora do you have strong thoughts on if we should / how to expose (hash-based) column partioning in the data model? I'd like to have an issue to collect discussion on this topic, but I'm still working through it in my head. If you don't have strong thoughts then I'll assign myself a task to write up an issue.

No strong thoughts from me, but I will be very happy to participate in a discussion :)... Without giving it too much thought, I think we will probably want to allow the divisions to be represented by something like a pd.DataFrame object. This way, the divisions can correspond to either an Index/MultiIndex, or an arbitrary set of columns. We may end up wanting to introduce a new Divisions class to organize the necessary attributes. If the divisions correspond to hashed values rather than literal values, defining an attribute like hash_partitioned=True may do the trick. [EDIT: I guess for the hashed case we wouldm't really want a full pd.DataFrame of division values...]

rjzamora · 2020-05-26T15:56:09Z

Looks like failures are unrelated.

TomAugspurger · 2020-05-26T20:40:43Z

Yep, those were just fixed on master. I'll merge this and open an issue on the data model things.

rjzamora added 6 commits April 2, 2020 20:34

generalize shuffle to avoid creating _partition column when possible

171a55e

full hash-based shuffle integration

17986f5

add repartition columns= option

60d0032

remove columns= from repartition - should revisit that feature later

302aa81

saving state

5a9b29a

include unsigned-signed upcasting - should probably move into cudf

2f1beca

rjzamora mentioned this pull request Apr 17, 2020

Cleanup dask_cudf rearrange_by_hash and implement edge cases rapidsai/cudf#4924

Merged

rjzamora mentioned this pull request Apr 24, 2020

[DISCUSSION] Column-based "repartitioning"/shuffling API in dask.dataframe #6133

Closed

add shuffle_dtype argument

4d39c39

rjzamora marked this pull request as ready for review April 27, 2020 16:06

Merge branch 'master' into hash-join

c36980b

TomAugspurger reviewed Apr 27, 2020

View reviewed changes

some code review items

592eb1d

Merge remote-tracking branch 'upstream/master' into hash-join

457ec9d

fjetter reviewed Apr 30, 2020

View reviewed changes

rjzamora added 3 commits May 11, 2020 07:53

Merge remote-tracking branch 'upstream/master' into hash-join

c066601

remove shuffle_dtype

a4acc55

improve testing

afb11fe

rjzamora changed the title ~~[WIP] Generalize rearrange_by_column_tasks and optimize shuffle~~ [REVIEW] Generalize rearrange_by_column_tasks and add DataFrame.shuffle May 11, 2020

add __simple_rearrange_by_column_tasks and test staged and non-staged…

54bc07b

… shuffle

rjzamora changed the title ~~[REVIEW] Generalize rearrange_by_column_tasks and add DataFrame.shuffle~~ [WIP] Generalize rearrange_by_column_tasks and add DataFrame.shuffle May 11, 2020

rjzamora added 2 commits May 20, 2020 06:48

Merge remote-tracking branch 'upstream/master' into hash-join

4c04066

skip graph size checks in test_merge_column_and_index.py

438856a

rjzamora commented May 20, 2020

View reviewed changes

dask/dataframe/tests/test_merge_column_and_index.py Show resolved Hide resolved

rjzamora changed the title ~~[WIP] Generalize rearrange_by_column_tasks and add DataFrame.shuffle~~ Generalize rearrange_by_column_tasks and add DataFrame.shuffle May 20, 2020

TomAugspurger reviewed May 21, 2020

View reviewed changes

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/shuffle.py Outdated Show resolved Hide resolved

dask/dataframe/core.py Outdated Show resolved Hide resolved

dask/dataframe/shuffle.py Outdated Show resolved Hide resolved

rjzamora added 2 commits May 21, 2020 08:17

minor code-review suggestions

3b978cd

handle list_like (not just lists)

f888ce6

using >=390 rather than >=400 for graph-size check

b20229b

rjzamora changed the title ~~Generalize rearrange_by_column_tasks and add DataFrame.shuffle~~ [REVIEW] Generalize rearrange_by_column_tasks and add DataFrame.shuffle May 22, 2020

fix test conflict

e053ca0

TomAugspurger approved these changes May 26, 2020

View reviewed changes

TomAugspurger merged commit 7168524 into dask:master May 26, 2020

TomAugspurger mentioned this pull request May 26, 2020

Non-index-based partitioning of Dask DataFrames #6246

Open

rjzamora deleted the hash-join branch May 27, 2020 02:52

This was referenced May 27, 2020

[Bug fix] Pass ignore_index to dd_shuffle from DataFrame.shuffle #6247

Merged

[FEA] Support new DataFrame.shuffle API in dask_cudf rapidsai/cudf#5307

Closed

[REVIEW] Leverage dask.dataframe shuffle API in dask_cudf rapidsai/cudf#5318

Merged

Uh oh!

Conversation

rjzamora commented Apr 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 21, 2020

Uh oh!

rjzamora commented Apr 21, 2020

Uh oh!

rjzamora commented Apr 27, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TomAugspurger Apr 27, 2020

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Apr 27, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rjzamora commented Apr 27, 2020

Uh oh!

TomAugspurger commented Apr 28, 2020

Uh oh!

rjzamora commented Apr 29, 2020

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Apr 30, 2020

Uh oh!

rjzamora commented Apr 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented May 11, 2020

Uh oh!

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rjzamora commented May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented May 21, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented May 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented May 26, 2020

Uh oh!

TomAugspurger commented May 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rjzamora commented Apr 4, 2020 •

edited

Loading

rjzamora commented Apr 30, 2020 •

edited

Loading

rjzamora commented May 21, 2020 •

edited

Loading

rjzamora commented May 26, 2020 •

edited

Loading