[HACK] Ordering to priorities "shuffle-split" by madsbk · Pull Request #6051 · dask/dask

madsbk · 2020-03-31T14:29:37Z

This PR is a HACK to do breadth first ordering of shuffle-split tasks.

The scheduling policies of Dask is depth-first generally speaking, which works great in most cases. However, it can increase the memory usage significantly when we are splitting the output of a task into many small task (like in rearrange_by_column_tasks()'s shuffle-group and shuffle-split tasks). In this case depth-first delays the freeing of shuffle-group until the end of the shuffling, which uses much more memory than a breadth first ordering where all the shuffle-split tasks are finished immediately and the output of shuffle-group can be freed before continuing.

This is a HACK please don't merge, let's find a more general solution to this issue.

cc. @rjzamora, @beckernick

mrocklin · 2020-03-31T14:54:58Z

cc @eriknw

…

On Tue, Mar 31, 2020 at 7:29 AM Mads R. B. Kristensen < ***@***.***> wrote: This PR is a HACK to do *breadth first* ordering of shuffle-split tasks. The scheduling policies of Dask is *depth-first* generally speaking, which works great in most cases. However, it can increase the memory usage significantly when we are splitting the output of a task into many small task (like in rearrange_by_column_tasks() <https://github.com/dask/dask/blob/fa63ce13ee1773d2042654a26a479bce932f292e/dask/dataframe/shuffle.py#L423>'s shuffle-group and shuffle-split tasks). In this case *depth-first* delays the freeing of shuffle-group until the end of the *shuffling*, which uses much more memory than a *breadth first* ordering where all the shuffle-split tasks are finished immediately and the output of shuffle-group can be freed before continuing. This is a HACK please don't merge. cc. @rjzamora <https://github.com/rjzamora>, @beckernick <https://github.com/beckernick> ------------------------------ You can view, comment on, or merge this pull request online at: #6051 Commit Summary - order(): added hack to priorities "shuffle-split" File Changes - *M* dask/order.py <https://github.com/dask/dask/pull/6051/files#diff-d26fc8ea46375896e990e8f20828457f> (9) Patch Links: - https://github.com/dask/dask/pull/6051.patch - https://github.com/dask/dask/pull/6051.diff — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#6051>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCKNEOOTTIU5SLT7M3RKH46HANCNFSM4LXTEZLA> .

rjzamora · 2020-03-31T15:04:02Z

I can confirm that this helps dramatically for our problematic use case. What is the path forward to officially enable an optional breadth-first execution?

TomAugspurger · 2020-03-31T15:16:11Z

@madsbk can you share a code snippet that exercises this, and perhaps a a couple graphs of visualize(color="order", cmap="autumn", node_attr={"penwidth": "4"}) with the before & after?

beckernick · 2020-03-31T15:19:00Z

@TomAugspurger I'll provide a clean example in a gist as soon as I can

beckernick · 2020-03-31T16:24:02Z

@TomAugspurger do you expect this to change the task graph itself, or just the order of execution? Visually, these should be the same before and after, right? cc @rjzamora

TomAugspurger · 2020-03-31T16:25:52Z

Just the order of execution. The color="order" is what would change the visual representation.

beckernick · 2020-03-31T18:36:14Z

The following example task graph visualizations come from the code snippet at the bottom. This snippet just creates a dataset and runs hash-based repartitioning. One is from this PR, one is from the current release (2.13). Apologies in advance that they use a GPU.

This is the current execution order:

This is the execution order on this PR:

These were generated from the following code snippet:

import sys

import dask_cudf
import cudf

filename = sys.argv[1]

df = cudf.datasets.randomdata(100000)
ddf = dask_cudf.from_cudf(df, 10)

ddf = ddf.repartition(columns="id")

ddf.visualize(
    color="order", cmap="autumn", node_attr={"penwidth": "4"},
    filename=f"{filename}"
)

cc @TomAugspurger @madsbk @rjzamora @mrocklin

TomAugspurger · 2020-03-31T18:57:40Z

Thanks. Is dask_cudf.repartition similar to dask.dataframe.DataFrame.set_index? If so, this may be a reproducer with just Dask

In [1]: import dask.datasets

In [2]: ts = dask.datasets.timeseries()

In [3]: result = ts.set_index("id")

We shouldn't need anything with a cluster / distributed, since this is the static ordering done prior to sending the task graph to the scheduler.

Edit: the task graph from my example looks quite different, so it may not be representative of the original problem.

beckernick · 2020-03-31T19:10:45Z

Ah, right. Good point on not needing the cluster 😄

This operation is similar in nature to set_index, but we don't end up with a difference due to the lack of an explicit shuffle-split task (this PR is only looking for that string in a task. The set_index op has shuffle-collect tasks.

rjzamora · 2020-03-31T20:10:18Z

@TomAugspurger - A good dask-only reproducer is something like this:

import dask.dataframe as dd
import pandas as pd
import numpy as np

size = 48
df = pd.DataFrame(
    {
        "index": np.random.choice(list(range(4)), size),
        "a": np.arange(size),
    }
)
ddf = dd.from_pandas(df, npartitions=4)

result = dd.shuffle.rearrange_by_divisions(
    ddf, "index", (0, 1, 2, 3, 4), shuffle="tasks"
)

Note that the repartition(columns=) API in dask_cudf is using something very similar to Dask's rearrange_by_column_tasks (which also uses the "shuffle-split" name convention).

rjzamora · 2020-03-31T21:09:49Z

dask/order.py

+    # TODO: Hack to priorities "shuffle-split"
+    shuffle_split_keys = []
+    for k in result.keys():
+        if len(k) > 0 and "shuffle-split" in k[0]:


What about using a general label that any API or user could add to the key of a task? For example, something like "dsk-prioritize":

# Prioritize tasks with "dsk-prioritize" annotation for k in list(result.keys()): if k and "dsk-prioritize" in k[0]: result[k] = 0

I know you mentioned off-line that the best general solution would be to annotate the priority within the task. Is this what you had in mind - or something more sophisticated? I guess the specific priority could also be included in the annotation if the user/api wants really fine-grained control.

Yes, something like this but overloading key names is properly too much of a hack to be accepted :)

beckernick · 2020-03-31T21:26:16Z

@TomAugspurger - A good dask-only reproducer is something like this:
import dask.dataframe as dd
import pandas as pd
import numpy as np

size = 48
df = pd.DataFrame(
    {
        "index": np.random.choice(list(range(4)), size),
        "a": np.arange(size),
    }
)
ddf = dd.from_pandas(df, npartitions=4)

result = dd.shuffle.rearrange_by_divisions(
    ddf, "index", (0, 1, 2, 3, 4), shuffle="tasks"
)
Note that the repartition(columns=) API in dask_cudf is using something very similar to Dask's rearrange_by_column_tasks (which also uses the "shuffle-split" name convention).

For reference, these graphs visualize as the following:

The current execution:

This PR

madsbk · 2020-04-01T13:07:39Z

Opened an discussion of a better solution than this hack :)
#6054

mrocklin · 2020-04-01T16:26:52Z

Thanks for raising @madsbk , and for producing images and reproducers @beckernick and @rjzamora .

Whenever I come across a case where someone has made a particular workflow faster/better by working around Dask's scheduling heuristics I try to see if there is a way to generalize the improvement, rather than finding ways to make the workaround easier. Early on we had lots of these situations, and being disciplined about learning from special-case improvements and applying those lessons to the global heuristics made it so that our heuristics became decent over time (these sorts of situations are much less common today).

To that end, I'm curious about how Dask's ordering got this wrong. I'd like to dig into this comment from @madsbk :

The scheduling policies of Dask is depth-first generally speaking, which works great in most cases. However, it can increase the memory usage significantly when we are splitting the output of a task into many small task (like in rearrange_by_column_tasks()'s shuffle-group and shuffle-split tasks). In this case depth-first delays the freeing of shuffle-group until the end of the shuffling, which uses much more memory than a breadth first ordering where all the shuffle-split tasks are finished immediately and the output of shuffle-group can be freed before continuing.

First, you might want to try #5872 by @eriknw , which changes around ordering. I'm not confident that it will resolve your problem here, but it should be easy to try.

Second, is there anything we can learn from this so that Dask makes the right decision automatically, rather than requiring special input from the user? The ordering code is challenging to get into today, but it may be that @eriknw (who I believe is lurking on this thread) could help if we could find some improvement to make.

eriknw · 2020-04-01T16:54:49Z

Let me confirm what's happening here. The data from a task such as (1, (0,)) in #6051 (comment)) is large. All of the dependents of this task create small data. In this case, we want to compute all the dependents so we can release the parent data. In other words, the total size of all dependents is at most comparable to the size of the parent.

I don't think there is much order can do here without knowing information about which tasks are small. If there was even only one other large dependent, then calculating all dependents is the wrong thing to do. If we had estimates of the size of each task, then both fuse and order could do better.

In this example, are the tasks that create small data getters such as getitem?

rjzamora · 2020-04-01T17:00:02Z

...I try to see if there is a way to generalize the improvement, rather than finding ways to make the workaround easier.

We totally agree @mrocklin - We don't want to implement/maintain any workarounds unless it proves completely necessary :)

In this example, are the tasks that create small data getters such as getitem?

@eriknw - Yes, the tasks returning "large" data, are actually taking in a single dataframe partition and then splitting the partition into a dictionary of smaller dataframes. The "shuffle-split" tasks are just calling getitem on a single element of the large dictionary.

madsbk · 2020-04-02T14:59:54Z

Agree with @eriknw, just by looking at the structure of the task graph, order() has no way of determining when breadth first is preferable.

I have implemented a non-intrusive solution here: #6059.

eriknw · 2020-04-30T20:04:37Z

I've continued to ponder a more general solution.

It may not be unreasonable to identify and keep track of tasks that are expected to be much smaller than their dependencies. dask.array.optimize already does something similar by identifying keys (such as those from GETTERS) to not fuse.

By knowing tiny tasks, both fuse and order can be smarter and should be able to handle the example of this PR. I bet I can whip of a PoC for order if there's interest.

I don't have a strong opinion how this information is created and managed (such as from task annotations vs other). My PoC will probably accept a set of "tiny_keys".

mrocklin · 2021-06-22T13:13:20Z

I'm excited to see activity on this PR. Now that annotations are in I'm hopeful that this will be an easier win.

madsbk · 2021-06-23T10:02:07Z

I'm excited to see activity on this PR. Now that annotations are in I'm hopeful that this will be an easier win.

Yeah, finally got around to look at this again :)
Proposed a solution that make use of our new annotation API: #7826

rjzamora reviewed Mar 31, 2020

View reviewed changes

madsbk mentioned this pull request Apr 1, 2020

[FEA] Task Graph Annotation #6054

Closed

madsbk mentioned this pull request Apr 2, 2020

Non-intrusive task annotations #6059

Closed

3 tasks

quasiben mentioned this pull request Apr 28, 2020

Graph Construction Time dask/distributed#3750

Closed

madsbk mentioned this pull request Apr 30, 2020

[WIP] New Dask Scheduler for cuDF rapidsai/cudf#5053

Closed

madsbk mentioned this pull request May 19, 2020

[FEA] Dynamic Task Graph / Task Checkpointing dask/distributed#3811

Closed

madsbk mentioned this pull request Jul 10, 2020

[FEA] Allow communicating spilled data rapidsai/dask-cuda#342

Closed

madsbk force-pushed the shuffle_split_high_priority branch 2 times, most recently from 76a4898 to e788c9f Compare September 30, 2020 14:45

madsbk force-pushed the shuffle_split_high_priority branch 2 times, most recently from 30093bf to 888dcdb Compare October 12, 2020 12:08

Base automatically changed from master to main March 8, 2021 20:19

order(): added hack to priorities "shuffle-split"

5ef58ca

madsbk force-pushed the shuffle_split_high_priority branch from 888dcdb to 5ef58ca Compare June 22, 2021 08:04

tmp

c8eff5f

madsbk mentioned this pull request Jun 23, 2021

[REVIEW] Prioritize getitem tasks in shuffle #7826

Closed

3 tasks

madsbk marked this pull request as ready for review June 23, 2021 12:02

madsbk closed this Jun 23, 2021

rjzamora mentioned this pull request Dec 3, 2021

Suboptimal graph structure when read-writing a parquet #8445

Closed

Uh oh!

Conversation

madsbk commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Mar 31, 2020 via email

Uh oh!

rjzamora commented Mar 31, 2020

Uh oh!

TomAugspurger commented Mar 31, 2020

Uh oh!

beckernick commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beckernick commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Mar 31, 2020

Uh oh!

beckernick commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beckernick commented Mar 31, 2020

Uh oh!

rjzamora commented Mar 31, 2020

Uh oh!

rjzamora Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

madsbk Apr 1, 2020

Choose a reason for hiding this comment

Uh oh!

beckernick commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Apr 1, 2020

Uh oh!

eriknw commented Apr 1, 2020

Uh oh!

rjzamora commented Apr 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk commented Apr 2, 2020

Uh oh!

eriknw commented Apr 30, 2020

Uh oh!

mrocklin commented Jun 22, 2021

Uh oh!

madsbk commented Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

madsbk commented Mar 31, 2020 •

edited

Loading

beckernick commented Mar 31, 2020 •

edited

Loading

beckernick commented Mar 31, 2020 •

edited

Loading

beckernick commented Mar 31, 2020 •

edited

Loading

TomAugspurger commented Mar 31, 2020 •

edited

Loading

beckernick commented Mar 31, 2020 •

edited

Loading

madsbk commented Apr 1, 2020 •

edited

Loading

rjzamora commented Apr 1, 2020 •

edited

Loading

madsbk commented Jun 23, 2021 •

edited

Loading