-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I think it would be worth considering adding optional light-weight classes to represent keys and tasks in a dask graph. These would complement the existing dask.core.quote for literals.
This would allow for much clearer intent when creating dask graphs, and better error messages when things go wrong (e.g., for #2298), because dask could know unambiguously what an object is intended to represent without needing to guess about what it is. For example, if a key is not found, dask could raise an error instead of using it as a literal.
These could be simple tuple subclasses, e.g.,
class Key(tuple):
__slots__ = ()
def __new__(cls, *args):
return tuple.__new__(Key, args)
def __repr__(self):
contents = repr(tuple(self))
if len(self) == 1:
contents = contents[:-len(',)')] + ')'
return 'Key{}'.format(contents)
The Task class could automatically handle **kwargs in the proper fashion, e.g., Task(pd.read_csv, filename, sep='\t').
This is more verbose than using Python builtins, but not onerously so. E.g., adapting the "Custom Graphs" example from the docs:
from dask import Task, Key
...
dsk = {'load-1': Task(load, 'myfile.a.data'),
'load-2': Task(load, 'myfile.b.data'),
'load-3': Task(load, 'myfile.c.data'),
'clean-1': Task(clean, Key('load-1')),
'clean-2': Task(clean, Key('load-2')),
'clean-3': Task(clean, Key('load-3')),
'analyze': Task(analyze, [Key('clean-%d') % i for i in [1, 2, 3]]),
'store': Task(store, Key('analyze'))}Possibly, we would want a "strict evaluation" mode that requires all tasks and keys to be wrapped in the appropriate classes, and switches the default interpretation for everything else to be a literal. Think of this as "strong typing" for dask.
I think this would be really valuable for library code, such as the existing dask collections.