Add a Task class to replace tuples for task specification #11248

fjetter · 2024-07-24T11:36:45Z

This is an early version that will close #9969

It introduces a new Task class (name is subject to change) and a couple of other related subclasses that should replace the tuple as a representation of runnable tasks.

The benefits of this are outlined in #9969 but are primarily focused to reduce overhead during serialization and parsing of results. An important result is also that we can trivially cache functions (and arguments if we wish) to avoid problems like dask/distributed#8767 where users are erroneously providing expensive to pickle functions (which also happens frequently in our own code and/or downstream projects like xarray)

This approach allows us to convert the legacy dsk graph to the new representation with full backwards compatibility. Old graphs can be migrated and new ones written directly using this new representation which will ultimately reduce overhead.

I will follow up with measurements shortly.

Sibling PR in distributed dask/distributed#8797

dask/task_spec.py

github-actions · 2024-07-24T12:09:00Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

13 files ± 0 13 suites ±0 3h 9m 31s ⏱️ + 7m 47s
13 222 tests + 39 12 166 ✅ + 39 1 056 💤 ± 0 0 ❌ ±0
137 910 runs +507 118 882 ✅ +497 19 028 💤 +10 0 ❌ ±0

Results for commit 8aed17b. ± Comparison against base commit a6d0bdc.

♻️ This comment has been updated with latest results.

fjetter · 2024-07-30T12:17:26Z

dask/task_spec.py

+    _T = TypeVar("_T", bound="BaseTask")
+
+
+class WrappedKey:


This had to move from distributed since we'd get otherwise horrible circular imports

fjetter · 2024-07-30T12:17:47Z

dask/task_spec.py

+    return typ(*args, **kwargs)
+
+
+def convert_old_style_task(k, arg, all_keys, only_refs) -> BaseTask:


this is very similar to unpack_remotedata

fjetter · 2024-07-30T12:18:15Z

dask/task_spec.py

+    return new_dsk
+
+
+class KeyRef:


TODO: Maybe this should inherit from WrappedKey or be replaced by it entirely.

For me, calling this class TaskRef would make more sense since it does reference a Task via its key.

Suggested change

class KeyRef:

class TaskRef:

I'm actually leaning towards nuking this class entirely in favor of WrappedKey (keeping that for backwards compat) but I'm happy to have TaskRef as an alias

fjetter · 2024-07-30T12:18:45Z

dask/task_spec.py

+        return True
+
+
+_auto = object()


todo: unused

phofl

gave this a first look and looks good!

phofl · 2024-07-30T13:43:59Z

dask/task_spec.py

+    "literal-string" ~ Literal("key", "literal-string")
+
+
+Keys, Aliases and KeyRefs


How would I rename a result that was previously stored under a different key?

{"key": KeyRef("old-key")}?

fjetter · 2024-07-31T11:58:10Z

In case anybody reviews. Just had a conversation with Hendrik and here are a couple of expected concerns that were raised and discussed

Sequence and Dict of task are relatively useless and we may be better off killing them entirely in favor of an ordinary task that encodes the list/dict generation. During development this was a (kind of) helpful construct but I wouldn't expect this to be used by any user. User API is pretty much only Task and KeyRef
LiteralTask is by itself not very useful but this maintains an abstraction where all values of a low level graph obey the same callable structure which makes it quite natural to recurse into it. Again, this wouldn't be used by end users (and likely not even devs outside of the parsing code) (The literal is rendering the normalization I'm talking about i order: remove data task graph normalization #11263 irrelevant)

fjetter · 2024-08-26T16:34:55Z

dask/task_spec.py

+        return super().__sizeof__() + sizeof(self.tasks)
+
+
+class SequenceOfTasks(BaseTask):


Note that I opted to keep this after all. It is not something end users (or even developers) really have to touch but it makes internals of the Task class much smoother. I'm not married to it, though, and could see it being removed down the line again. It would just require a lot more iterating over things, checking for types, etc.

Personally, I am fine adding some "boilerplate" structures if this makes things easier to understand. I like that these classes are different, this makes the things you have to consider in a single case easier

I'm not necessarily against these classes, I mostly dislike their name which suggests less magic than they actually perform. This might be rather something like a Sequence{Consolidation|Conflation|Combination|Comprehension|Reduction}Task

I don't feel like they are doing much magic and I'm not sure what kind of surprises are hidden here.
I don't find the names you suggested to help with any of this ambiguity.

I'm happy not to do anthing here, this can always be adjusted later should it remain a source of confusion.

fjetter · 2024-08-26T16:38:44Z

dask/task_spec.py

There is an open question about public API surface.

I'm inclined to make this module private for now but export something like Task or WrappedKey top level but I don't have a strong opinion here.

fjetter

Right now, the Task signature is key: Key, func: Callable, args: tuple, kwargs: dict but it is also possible to make it key: Key, func: Callable, /, *args: Any, **kwargs: Any which would make it more natural to write a task, e.g. Task(key, func, arg1, arg2, kwarg1='foo').

I chose to not go down this path because it makes internals a little more complex (namely we'd have to either make SequenceOfTasks iterable or remove it entirely). The additional complexity is OK if we want this API change

phofl

This change is only hooked into distributed? I.e. if we get an old style grah then the sync scheduler will still execute the old style graph?

Wouldn't we want to do https://github.com/dask/distributed/pull/8797/files#r1731514684 client side to reduce the upload size?

dask/core.py

phofl · 2024-08-26T13:57:08Z

dask/optimization.py

+            len(vals) == 1
+            and k not in (keys or ())
+            and k in dsk
+            and not isinstance(dsk[k], BaseTask)


I like that this is explicit now

dask/task_spec.py

phofl · 2024-08-26T14:34:15Z

dask/task_spec.py

+                return Task(k, identity, (func, *new_args))
+            return Task(key=k, func=func, args=tuple(new_args))
+    try:
+        if isinstance(arg, (bytes, int, float, str, tuple)) and arg in all_keys:


a key can be bytes??? or am I missing something here?

Theoretically, yes. This is a remnant of python2 compat where str was of type bytes and not unicode.

We should change this but I never put in the time to clean everything up. FWIW these two things needed to change

dask/dask/core.py

Lines 202 to 217 in 2fbe18b

def iskey(key: object) -> bool:

"""Return True if the given object is a potential dask key; False otherwise.

The definition of a key in a Dask graph is any str, bytes, int, float, or tuple

thereof.

See Also

--------

ishashable

validate_key

dask.typing.Key

"""

typ = type(key)

if typ is tuple:

return all(iskey(i) for i in cast(tuple, key))

return typ in {bytes, int, float, str}

dask/dask/typing.py

Line 28 in 2fbe18b

Key: TypeAlias = Union[str, bytes, int, float, tuple["Key", ...]]

(and whatever pops up with mypy then).

I haven't used iskey here for perf reasons. The function dispatch is actually noticeable here

dask/task_spec.py

phofl · 2024-08-26T17:55:07Z

dask/task_spec.py

phofl · 2024-08-26T17:56:26Z

dask/task_spec.py

+        return super().__sizeof__() + sizeof(self.tasks)
+
+
+class SequenceOfTasks(BaseTask):


Personally, I am fine adding some "boilerplate" structures if this makes things easier to understand. I like that these classes are different, this makes the things you have to consider in a single case easier

dask/task_spec.py

phofl · 2024-08-26T18:07:10Z

dask/task_spec.py

+            isinstance(t, Alias)
+            and t.key not in keys
+            and t.key != k
+            and t.key in dsk


t.key not being in dsk means that we access a persisted result for example?

Can you add a brief comment for this as well? Took me a while to come to that conclusion

phofl · 2024-08-26T18:10:21Z

dask/task_spec.py

+            and t.key not in keys
+            and t.key != k
+            and t.key in dsk
+            and len(dependents[t.key]) == 1


This is protection if t.key is actually an input for a non-trivial task?

We would also catch a key that is an alias for 2 different results?

This behavior is tested in test_resolve_aliases

dsk = { "bar": "foo", "foo": (func, "a", "b"), "baz": "bar", "foo2": (func, "bar", "c"), }

will convert to

{ 'bar': Task('foo'), 'baz': Alias(bar), 'foo2': Task('foo2') }

i.e. bar is pointing to foo and foo is only used by bar. Therefore, bar will just inherit the task (Now that I put this here, I notice that we should likely change the key of that Task object as well)

However, bar is being used in baz and foo2 so inlining it would require us to compute it twice.

Can you add a short comment, it took me a while to remember what exactly this was supposed to catch

fjetter · 2024-08-27T08:49:52Z

Wouldn't we want to do https://github.com/dask/distributed/pull/8797/files#r1731514684 client side to reduce the upload size?

That would force materialization client side as well... which may be ok... I have to think about this a little more.

It may be the case that we're currently always materializing things. Arrays are materialized because low level fusion is still enabled and dataframes are materialized because they are using dask-expr. I guess the only case where this is not true are some trivial things like Client.map.

phofl · 2024-08-27T11:33:17Z

That would force materialization client side as well... which may be ok... I have to think about this a little more.

Yeah I think maybe we only do this if the graph is materialised already? Not sure if this would add a lot of complexity though

dask/core.py

hendrikmakait · 2024-08-27T13:18:01Z

dask/optimization.py

-    ]
+    def inlinable(key, task):
+        if (
+            not isinstance(task, BaseTask)


Are Tasks never inlineable? Is this because we inline elsewhere?

this is because legacy and new don't play well with each other.

The convert_legacy_* can handle a mix of tasks but other layers cannot (so far, only distributed-only graphs come with Task objects to the conversion in distributed alone is sufficient. The very next step here should be to do a similar change in dask/dask to allow internals to assume we're working in the new system.

TLDR for now, legacy low level optimization is/should be disabled whenever a BaseTask is encountered.

hendrikmakait · 2024-08-27T13:28:07Z

dask/task_spec.py

+    return new_dsk
+
+
+class KeyRef:


For me, calling this class TaskRef would make more sense since it does reference a Task via its key.

Suggested change

class KeyRef:

class TaskRef:

hendrikmakait · 2024-08-27T13:39:55Z

dask/task_spec.py

+        if (
+            isinstance(t, Alias)
+            and t.key not in keys
+            and t.key != k


What would actually happen in this self-referential case?

at the very least dask.order raises a "cycles found" exception because strictly speaking a graph with this kind of self reference is not a DAG.

We had similar logic in the distributed scheduler before so without this kind of filtering, nothing will work. It's just a question about where we do this

hendrikmakait · 2024-08-27T13:55:11Z

dask/task_spec.py

+        return super().__sizeof__() + sizeof(self.tasks)
+
+
+class SequenceOfTasks(BaseTask):


I'm not necessarily against these classes, I mostly dislike their name which suggests less magic than they actually perform. This might be rather something like a Sequence{Consolidation|Conflation|Combination|Comprehension|Reduction}Task

fjetter · 2024-08-27T16:46:58Z

I added a commit that changed the signature to Task(key, func, /, *args, **kwargs) which makes internals sometimes a little more awkward but is much more intuitive to use, I think

hendrikmakait · 2024-08-27T17:34:55Z

dask/_task_spec.py

+    __weakref__: Any = None
+    __slots__ = tuple(__annotations__)
+
+    def __init__(self, key: KeyRef | KeyType):


The Alias class feels slightly off. It's the only BaseTask class that does not return the key with which it's also registered - or at least, should be registered with - into dsk. Instead, it returns the aliased key. I think that the intention is to make it simple to skip the Alias, but we have to implement code for optimizing Aliases away nonetheless since users could also just explicitly use the key from dsk. This might be a pain we're facing because the existing API forces us into allowing Aliases. A more explicit variant would be something like Alias(old, new)

An earlier version distinguished those and I had very often Alias(key, key) objects which felt a little redundant but I think I can make it work. It also struck me as weird to have the key attribute allowed to be different.

Alias(key, ref=None) seems appropriate

I refactored this in 60df5d9

I used target for the variable name instead of ref or new (since new is not always true, sometimes this is a genuine reference and ref is already taken)

This also makes the resolve_aliases (function name is not ideal) code a little easier to read imo

Sounds good, target definitely works. I wanted to avoid something ambiguous like Alias(key, alias) where it's unclear what the original name is.

fjetter · 2024-08-28T10:21:10Z

dask/_task_spec.py

+_func_cache: MutableMapping = {}
+_func_cache_reverse: MutableMapping = {}


Note this cache is unbounded. I didn't want to use an LRU (also because the acutally usable class for this is in distributed and the functools cache, etc. is not good). Also it's unclear how large the LRU should be.

Ideally, this was a cache that would work on bytes size so we could use zict, etc. for this but I didn't want to deal with this complexity just yet.

hendrikmakait · 2024-08-28T13:49:07Z

dask/_task_spec.py

+            # 1. The target key is not in the keys set. The keys set is what the
+            #    user is requesting and by collapsing we'd no longer be able to
+            #    return that result.
+            # 2. The target key is in fact part of dsk. If it isnt' this could


Suggested change

# 2. The target key is in fact part of dsk. If it isnt' this could

# 2. The target key is in fact part of dsk. If it isn't this could

dask/core.py

fjetter · 2024-08-29T12:46:26Z

FYI Progress is a little hampered. I'm running into the wildest, pretty much unrelated, problems. Just to name a couple

There was a thing in our tokenization code
Something in dask-expr is generating a corrupt code
Some scheduler state machine corruption
...

I'm doing my best to isolate those but they are often a little hard to reproduce and I have to cross check whether it is a problem on main as well.

fjetter · 2024-08-29T12:53:38Z

dask/base.py

+        else:
+            seen = seen.copy()
+            tok = _seen.set(seen)


So, I'm running into an issue on dask/distributed in the test test_threadsafe_get. The tests is launching 30 threads that are concurrently hammering the scheduler with the same computation request over and over again, i.e. the tokens should always be identical and the result should in most likeliness even be reused between all the threads.
However, I'm running into problems and this appears to fix it. Haven't managed to isolate it (already worked with RLocks in various places which didn't help). The problem seems to be a counting issue in the seen dictionary but I couldn't reduce it or write a test for it.

I'm slightly concerned about this fix since we call this function quite a lot and copying a non-empty set can be expensive (haven't done any measurements, yet)

I removed this again. A workaround for this is to use tokenize directly instead of normalize_token for the GraphNode classes. This way, the likelihood of having those nested self referencing constructs is much smaller / doesn't show up. It's also better from a performance perspective since normalize_token can generate ridiculously deeply nested structures.

hendrikmakait

Thanks, @fjetter! This looks overall good to me. I have a few non-blocking comments regarding consistent wording and adding issues for remaining TODOs in the code.

hendrikmakait · 2024-09-05T12:13:54Z

dask/_task_spec.py

+        return self.value
+
+    def __repr__(self):
+        return f"Literal({self.key}, type={self.typ}, {self.value})"


Suggested change

return f"Literal({self.key}, type={self.typ}, {self.value})"

return f"DataNode({self.key}, type={self.typ}, {self.value})"

hendrikmakait · 2024-09-05T12:14:56Z

dask/_task_spec.py

+""" Task specification for dask
+
+This module contains the task specification for dask. It is used to represent
+runnable and non-runnable (i.e. literals) tasks in a dask graph.


Suggested change

runnable and non-runnable (i.e. literals) tasks in a dask graph.

runnable (task) and non-runnable (data) nodes in a dask graph.

hendrikmakait · 2024-09-05T12:15:13Z

dask/_task_spec.py

+
+    {"a": func("b")} ~ DictOfTasks("key", {"a": Task("a", func, "b")})
+
+    "literal-string" ~ Literal("key", "literal-string")


Suggested change

"literal-string" ~ Literal("key", "literal-string")

"literal-string" ~ DataNode("key", "literal-string")

hendrikmakait · 2024-09-05T12:17:38Z

dask/_task_spec.py

+    def inline(self, dsk) -> GraphNode:
+        raise NotImplementedError("Not implemented")
+
+    def propagate_literal(self) -> GraphNode:


Should we stay consistent and rename this to something like propagate_data? There are a few other places where the word literal is being used, we may want to check for consistency there as well.

hendrikmakait · 2024-09-05T12:21:57Z

dask/optimization.py

                        (no_new_edges or height < max_depth_new_edges)
+                        and (
+                            not isinstance(dsk[parent], GraphNode)
+                            # TODO: substitute can be implemented with GraphNode.inline


Should we add an issue for this?

hendrikmakait · 2024-09-05T12:24:33Z

dask/task_spec.py

+        return super().__sizeof__() + sizeof(self.tasks)
+
+
+class SequenceOfTasks(BaseTask):


I'm happy not to do anthing here, this can always be adjusted later should it remain a source of confusion.

fjetter · 2024-09-05T12:39:53Z

FYI I'm currently working on removing the Sequence/Dict classes. I suspect there could be a perf gain. I'll report back once it is stable. Either way, we can move forward with this. Without any further changes in distributed or in this repo, this cahnge alone is pretty harmless

fjetter · 2024-09-06T13:43:49Z

I have a follow up ready that addresses all of the above review comments and removes the sequence/dict tasks. In favor of having a simple review, I'll move forward with merging this PR (will not affect anything, yet)

fjetter mentioned this pull request Jul 24, 2024

Use Task class instead of tuple dask/distributed#8797

Merged

fjetter commented Jul 24, 2024

View reviewed changes

dask/task_spec.py Outdated Show resolved Hide resolved

fjetter force-pushed the task_spec_class branch from a515079 to c3e8172 Compare July 30, 2024 08:58

This comment was marked as outdated.

Sign in to view

fjetter commented Jul 30, 2024

View reviewed changes

dask/task_spec.py Outdated

return True

_auto = object()

Copy link

Member Author

fjetter Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: unused

phofl reviewed Jul 30, 2024

View reviewed changes

hendrikmakait self-requested a review July 30, 2024 14:52

This was referenced Jul 31, 2024

order: remove data task graph normalization #11263

Merged

[WIP][POC] Add ResourceBarrier expression to change resources within an expression graph dask/dask-expr#1116

Draft

fjetter force-pushed the task_spec_class branch 3 times, most recently from ac42674 to 25b6c9d Compare August 8, 2024 09:35

fjetter mentioned this pull request Aug 12, 2024

Ensure client_desires_keys does not corrupt Scheduler state dask/distributed#8827

Merged

fjetter force-pushed the task_spec_class branch 2 times, most recently from 5bcfacd to 3ed7958 Compare August 23, 2024 16:50

fjetter marked this pull request as ready for review August 23, 2024 17:31

fjetter commented Aug 26, 2024

View reviewed changes

phofl reviewed Aug 26, 2024

View reviewed changes

hendrikmakait reviewed Aug 27, 2024

View reviewed changes

fjetter commented Aug 28, 2024

View reviewed changes

hendrikmakait reviewed Aug 28, 2024

View reviewed changes

fjetter commented Aug 29, 2024

View reviewed changes

fjetter mentioned this pull request Aug 29, 2024

Ensure tokenize is thread safe #11355

Closed

fjetter force-pushed the task_spec_class branch 4 times, most recently from 527e583 to 2d52549 Compare September 5, 2024 10:26

hendrikmakait approved these changes Sep 5, 2024

View reviewed changes

fjetter added 4 commits September 6, 2024 15:44

Use Task class instead of tuple to define run_spec

f0bb31a

black

6e7a0e0

add lru cache

05599cf

remove fork in environment yamls

8aed17b

fjetter force-pushed the task_spec_class branch from 2d52549 to 8aed17b Compare September 6, 2024 13:44

fjetter mentioned this pull request Sep 6, 2024

Tasks - Remove sequence dict classes #11377

Merged

fjetter changed the title ~~Use Task class instead of tuple~~ Add a Task class to replace tuples for task specification Sep 6, 2024

fjetter merged commit 863a049 into dask:main Sep 6, 2024

This was referenced Sep 17, 2024

Cannot import dask_histogram which causes failures downstream (hist.dask etc) with the newest dask 2024.9.0 dask-contrib/dask-histogram#149

Closed

Circular imports in dask-histogram/dask-awkward #11391

Open

jrbourbeau mentioned this pull request Sep 23, 2024

Delayed function with string argument value matching dask_key_name causes a circular reference detection #11403

Open

fjetter deleted the task_spec_class branch October 15, 2024 14:43

khaeru mentioned this pull request Nov 10, 2024

Update for dask 2024.11.0 khaeru/genno#149

Closed

2 tasks

		return typ(args, *kwargs)


		def convert_old_style_task(k, arg, all_keys, only_refs) -> BaseTask:

		"literal-string" ~ Literal("key", "literal-string")


		Keys, Aliases and KeyRefs

		return super().__sizeof__() + sizeof(self.tasks)


		class SequenceOfTasks(BaseTask):

	def iskey(key: object) -> bool:
	"""Return True if the given object is a potential dask key; False otherwise.

	The definition of a key in a Dask graph is any str, bytes, int, float, or tuple
	thereof.

	See Also
	--------
	ishashable
	validate_key
	dask.typing.Key
	"""
	typ = type(key)
	if typ is tuple:
	return all(iskey(i) for i in cast(tuple, key))
	return typ in {bytes, int, float, str}

		_func_cache: MutableMapping = {}
		_func_cache_reverse: MutableMapping = {}

	# 2. The target key is in fact part of dsk. If it isnt' this could
	# 2. The target key is in fact part of dsk. If it isn't this could

	return f"Literal({self.key}, type={self.typ}, {self.value})"
	return f"DataNode({self.key}, type={self.typ}, {self.value})"

	runnable and non-runnable (i.e. literals) tasks in a dask graph.
	runnable (task) and non-runnable (data) nodes in a dask graph.


		{"a": func("b")} ~ DictOfTasks("key", {"a": Task("a", func, "b")})

		"literal-string" ~ Literal("key", "literal-string")

	"literal-string" ~ Literal("key", "literal-string")
	"literal-string" ~ DataNode("key", "literal-string")

Uh oh!

Add a Task class to replace tuples for task specification #11248

Add a Task class to replace tuples for task specification #11248

Uh oh!

Conversation

fjetter commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test Results

Uh oh!

This comment was marked as outdated.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Jul 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter left a comment

Choose a reason for hiding this comment

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Aug 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter commented Jul 24, 2024 •

edited

Loading

github-actions bot commented Jul 24, 2024 •

edited

Loading

fjetter commented Jul 31, 2024 •

edited

Loading

fjetter Aug 27, 2024 •

edited

Loading