Non-intrusive task annotations by madsbk · Pull Request #6059 · dask/dask

madsbk · 2020-04-02T12:19:26Z

This PR introduces non-intrusive task annotation that makes it possible to annotate tasks with arbitrary information. In order to make this non-intrusive, this implementation annotate the task's callable using functools.partial thus any component in Dask/Distributed can be completely agnostic of the annotation without any performance overhead.

Overhead is introduced only for components that use the annotation. For instance, this PR makes order.order() traverse the graph keys once in order to check annotations, which is negligible compared to the rest of the computation in order.order().

Inspired by dask/distributed#2180, but this PR is less intrusive because it doesn't change a task's arguments.
Fixes #6054
Closes #6051

Docs added
Tests added / passed
Passes black dask / flake8 dask

cc. @rjzamora, @mrocklin, @eriknw, @dhirschfeld

dask/annotation.py

dhirschfeld · 2020-04-02T23:37:56Z

ping @sjperkins - having also worked on this feature you might have some thoughts on the implementation here.

sjperkins · 2020-04-03T08:04:27Z

Thanks for pinging me @dhirschfeld.

@madsbk, I think this is a great improvement on dask/distributed#2180. Placing the annotation on a functools wrapper (rather than at the end of the task tuple in dask/distributed#2180) makes it far easier to set and get the annotation.

A possible (minor, compared to its advantages) downside to the approach might be an increase in the size of the serialized graph due to the functools wrapper. This should be simple to test. However, I've always considered an annotation on every graph task to be a worst-case scenario. I think it's far more likely that only a few tasks will be annotated.

I think the challenge going forward will be handling the interaction between annotations and the optimization code, especially with regard to task fusion.

Initially, I thought some sort of logic hook would be needed to decide how to fuse tasks with differing annotations, but thinking about it a bit more I think that tasks with differing annotations should simply not be fused at all. Firstly, because catering to nested tasks will make the threaded and distributed schedulers more complex and secondly, because I don't think annotations generally combine well. For example:

tasks with differing priorities shouldn't be prioritised the same
tasks assigned to different workers shouldn't be assigned to the same worker.
tasks with different resource requirements shouldn't require one of them.

Another area to think about would be the Blockwise object and the related optimize_blockwise method. Once again, if the encapsulated function has differing annotations I think the layers should simply not be fused.

Tasks with the same annotation should be candidates for fusion.

TomAugspurger · 2020-04-03T11:39:18Z

dask/dataframe/shuffle.py

+    # We know that it is _always_ beneficial to prioritize the
+    # getitem() task because it makes it possible for Dask to free
+    # the output of shuffle_group() as fast as possible.


This statement sounds true of most getitems. Most of the time the output will be smaller than the input. What makes it especially important for the shuffle-split workload?

Agree, if we are getting all of the parent task so that it can be freed, prioritizing getitem() makes most often sense.
But in the general case getitem() could imply any kind of memory/compute usage e.g getting an item from a zict dictionary can be very expensive.

Right, in this case I think that the condition for this being useful is that the memory use of all of the dependents is less than or equal to the size of the dependency (equal in this case).

If we're submitting this graph along with other graphs at the same time (maybe the shuffle is part of a larger computation) will this also prioritize the shuffle code above those other parts of the graph, or will things still operate normally?

If we're submitting this graph along with other graphs at the same time (maybe the shuffle is part of a larger computation) will this also prioritize the shuffle code above those other parts of the graph, or will things still operate normally?

This PR effectively makes the scheduler schedule the getitem() tasks and their parent task shuffle_group() as a single node. The priority of the shuffle code as a whole is not changed thus things should still operate normally,

TomAugspurger · 2020-04-03T11:40:36Z

dask/order.py

+    for k, v in dsk.items():
+        if istask(v):


FYI, the order benchmarks from dask/dask-benchmarks don't show any slowdowns here. I thought that these two lines might be expensive, but if they are we don't catch it in those benchmarks.

I am both pleased and surprised by this

In other words, "I don't trust the benchmarks" :)

I'll see if I can make a benchmark that stresses this, because it really seems like it should have some cost.

Oh, I wasn't trying to cast doubt on the benchmarks. I'm genuinely pleased that this didn't have much of an effect. I haven't really taken much of a look at the benchmarks to know what's going on in them to be able to say either way. I'm mostly trusting that you and Erik have this under control.

👍

FWIW I tried a bit but couldn't make this block take more than ~3% or so of the run time. What's the worst-case scenario here? My thought was something like

diff --git a/dask/benchmarks/order.py b/dask/benchmarks/order.py index 2268fe3..620d3f7 100644 --- a/dask/benchmarks/order.py +++ b/dask/benchmarks/order.py @@ -1,5 +1,6 @@ from dask import array as da from dask.base import collections_to_dsk +from dask.core import get_dependencies from dask.order import order from .common import DaskSuite @@ -162,3 +163,19 @@ class OrderManySubgraphs(DaskSuite): def time_order_many_subgraphs(self, param): order(self.dsk) + + +class TimeOrderMisc(DaskSuite): + def setup(self): + dsk = {'0': (0,)} + for i in range(1, 1_000_000): + dsk[str(i)] = (f, str(i - 1)) + + dependencies = {k: get_dependencies(dsk, k) for k in dsk} + + self.dsk = dsk + self.dependencies = dependencies + + def time_order_silly(self): + order(self.dsk, dependencies=self.dependencies)

A large task graph (since this is making another pass over dsk.items()

A very simple task graph (less time in the rest of the order code)

even with that, I get a slight slowdown (7.58s -> 8.2s)

It wouldn't be the first time a benchmark proved our intuition wrong 😉

mrocklin

I've attached some comments and questions below.

Additionally, this is a big change for the project, and probably one that requires a decent amount of community involvement. It would be good to engage #3783 and figure out what the right long term approach should be. There are people there who, I think, care somewhat deeply about this topic and we should get their approval.

I think that this approach has a lot going for it in terms of being a lightweight modification of the current system, but a change like this is low level enough that I'd also like to make it only once,

mrocklin · 2020-04-03T16:19:22Z

dask/order.py

+    # Check task annotations
+    for k, v in dsk.items():
+        if istask(v):
+            f = v[0]  # We prioritize based on the first task function


What if the task is fused? It seems like this approach, while simple, may silently ignore user annotations if they have gone through task fusion (which is common).

True, in this case we only check the first function. We could use functions_of() to check all nested functions with the extra overhead that entails.
Alternately as @sjperkins suggest, the fusers could also check the annotations and act accordingly.

dask/annotation.py

mrocklin · 2020-04-03T16:22:23Z

dask/dataframe/shuffle.py

+    # We know that it is _always_ beneficial to prioritize the
+    # getitem() task because it makes it possible for Dask to free
+    # the output of shuffle_group() as fast as possible.


Right, in this case I think that the condition for this being useful is that the memory use of all of the dependents is less than or equal to the size of the dependency (equal in this case).

mrocklin · 2020-04-03T16:23:43Z

dask/dataframe/shuffle.py

+    # We know that it is _always_ beneficial to prioritize the
+    # getitem() task because it makes it possible for Dask to free
+    # the output of shuffle_group() as fast as possible.


If we're submitting this graph along with other graphs at the same time (maybe the shuffle is part of a larger computation) will this also prioritize the shuffle code above those other parts of the graph, or will things still operate normally?

mrocklin · 2020-04-03T16:24:01Z

dask/order.py

+    for k, v in dsk.items():
+        if istask(v):


I am both pleased and surprised by this

dhirschfeld · 2020-04-05T03:13:56Z

My interest in annotations is to (hopefully) support labelling tasks as being part of a particular job.

I have a dask cluster which can be running multiple jobs simultaneously, each of which may have hundreds of tasks associated with it. My hope is that by taking care to annotate the tasks as being part of a particular job when they're created that I can build a UI showing how many jobs are currently running and how many tasks each have remaining - e.g. something like:

JobA - 73/100
JobB - 97/123
JobC - 53/53

...instead of just 223/276 with no idea which jobs are running or what state they're in.

If you're being fancy you could also enable drilling down to see the status of the tasks associated with a particular job, perhaps by enabling filtering the task view by job/annotation.

madsbk · 2020-04-14T10:49:42Z

I think that this approach has a lot going for it in terms of being a lightweight modification of the current system, but a change like this is low level enough that I'd also like to make it only once,

I think we need to discuss this feature in two parts:

How to implement the task annotation. Do we want a lightweight approach like this PR or a more intrusive approach that introduce Key and Task classes instead of generic tuples: Key and Task classes #2299
What should be the semantic of the annotated information. Do we what a loose approach where components can communicate through annotations using there own protocol or do we want a more strict approach where all tasks are annotated with specific information.

Currently, this PR is an example of a loose approach where rearrange_by_column_tasks() communicate priority information to order.order() using its own semantic e.g. in this PR we only check the priority of first function in a task.

The loose approach makes easy to introduce new annotations like the labeling feature @dhirschfeld and it is easy to manage the overhead.

On the other hand, the strict approach makes it easy to reason about what information is in the annotation and we can decide on the semantic of all annotated information. However, in order to manage the performance overhead we probably have to use Cython, Numba, or some other library to implement the bottlenecks.

lr4d · 2020-10-09T11:39:20Z

@madsbk are you still working on this? I think this functionality would enable a lot of further improvement to be build on top of it

mrocklin · 2020-10-09T16:12:31Z

@lr4d @fjetter if you all are interested in annotations I would love to get your thoughts on #6701

rjzamora reviewed Apr 2, 2020

View reviewed changes

dask/annotation.py Outdated Show resolved Hide resolved

madsbk mentioned this pull request Apr 2, 2020

[HACK] Ordering to priorities "shuffle-split" #6051

Closed

TomAugspurger reviewed Apr 3, 2020

View reviewed changes

mrocklin reviewed Apr 3, 2020

View reviewed changes

madsbk added 5 commits April 24, 2020 14:57

Added annotation.py

016e7ea

order(): now checks annotations for priorities

1970f9e

rearrange_by_column_tasks(): priorities getitem()

f54dc19

reformat: flake8

84e1f88

Using functools.update_wrapper()

887104d

madsbk force-pushed the task_annotation branch from 5b7e1f6 to 887104d Compare April 24, 2020 12:57

beckernick mentioned this pull request Apr 30, 2020

[WIP] New Dask Scheduler for cuDF rapidsai/cudf#5053

Closed

Merge branch 'master' of github.com:dask/dask into task_annotation

62fb910

sjperkins mentioned this pull request May 14, 2020

Function task annotations dask/distributed#3796

Draft

madsbk closed this Dec 18, 2020

Uh oh!

Conversation

madsbk commented Apr 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dhirschfeld commented Apr 2, 2020

Uh oh!

sjperkins commented Apr 3, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madsbk Apr 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

madsbk Apr 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhirschfeld commented Apr 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madsbk commented Apr 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lr4d commented Oct 9, 2020

Uh oh!

mrocklin commented Oct 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

madsbk commented Apr 2, 2020 •

edited

Loading

madsbk Apr 3, 2020 •

edited

Loading

madsbk Apr 4, 2020 •

edited

Loading

dhirschfeld commented Apr 5, 2020 •

edited

Loading

madsbk commented Apr 14, 2020 •

edited

Loading