-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
There are a variety of reasons to annotate tasks, including resources like GPUs, memory constraints, retries, worker restrictions, and so on. There are some ways to specify annotations separately from tasks, such as with compute(..., retries=...) but this ends up being awkward.
There have been multiple requests to annotate the tasks themselves, which would make it a bit easier to track annotations and apply them at the point of graph creation. There are at least two issues about this #3783 and #6054 and an implementation at #6217
Unfortunately this is hard because our task type, tuple, isn't well set up for extension. Changing to a Task type is possible, but has some performance implications, and would be a large change at the core of the project, and so would need to be done with some care.
Annotated Layers
An alternative approach would be to annotate high level graph layers which are easy to modify and in flux now and so easy to change designs. Layers maybe also side-step some of the performance concerns.
This also has some limitations, but I think that most people asking for this feature might be ok with layer-based annotations.
Current work with layers
We're currently working to include all graph layers in Layer(Mapping) subclasses, and communicate these layers directly to the scheduler. This gives us a nice conduit of potentially richer information. These will be applied universally across all major Dask collections maintained within the dask/dask repository.
API
I'm going to suggest that we recommend using context managers for annotations like the following:
x = da.ones(10)
y = da.ones(10)
with dask.annotate(priority=1, retries=2):
z = x + yThe Layer.__init__ method would look at some global state for annotations, and apply those onto the layer on construction. Any layer made within the context block would be affected.
Limitations
I think that it's not yet clear what we would do with Delayed. Delayed does currently use HighLevelGraphs, but we're a bit sensitive here on performance grounds, just because there would be a separate layer per task, and overheads might creep up a little here.