-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I think this issue has some relation to dask/distributed#2127, where @kkraus14 wants to run certain tasks on CPU/GPU workers. I've also wanted to run tasks on specific workers, or require resources to be exclusive for certain tasks.
Currently, these task dependencies must be specified as additional arguments to compute/persist etc. rather than at the point of actual construction -- embedding resource/worker dependencies in the graph is not currently possible.
To support this, how about adding a TaskAnnotation type? This can be a namedtuple, itself containing nested tuples representing key-value pairs. e.g.
annot = TaskAnnotation(an=(('resource', ('GPU': '1'), ('worker', 'alice')))dask array graphs tend to have the following structure:
dsk = {
(tsk_name, 0) : (fn, arg1, arg2, ..., argn),
(tsk_name, 1) : (fn, arg1, arg2, ..., argn),
}How about embedding annotations within value tuples?
dsk = {
(tsk_name, 0) : (fn, arg1, arg2, ..., argn, annotation1),
(tsk_name, 1) : (fn, arg1, arg2, ..., argn, annotation2),
}If the scheduler discovers an annotation in the tuple, it could remove it from the argument list and attempt to satisfy the requested constraints. In the above example, annotations are placed at the end of the tuple, but the location could be arbitrary and multiple annotations are possible. Alternatively, it might be better to put them at the start.
I realise the above example is somewhat specific to dask arrays (I'm not too familiar with the dataframe and bag collections) so there may be issues I'm not seeing.
One problem I can immediately identify would be modifying existing graph construction functions to support the above annotations (atop/top support is probably the first place to look).