Skip to content

Layer Annotations#6767

Merged
mrocklin merged 15 commits intodask:masterfrom
sjperkins:layer-annotations
Nov 5, 2020
Merged

Layer Annotations#6767
mrocklin merged 15 commits intodask:masterfrom
sjperkins:layer-annotations

Conversation

@sjperkins
Copy link
Member

  • Tests added / passed
  • Passes black dask / flake8 dask

@sjperkins sjperkins marked this pull request as draft October 26, 2020 17:28
@sjperkins sjperkins mentioned this pull request Oct 26, 2020
@mrocklin
Copy link
Member

mrocklin commented Oct 26, 2020 via email

@sjperkins
Copy link
Member Author

@mrocklin I got delayed while preparing this: https://www.radiocamera.io/seminars/dask ;-)

Looking forward to plugging this into distributed once dask/distributed#4140 is merged.

@sjperkins
Copy link
Member Author

/cc @madsbk

* Pass Mapping key views into LayerAnnotations
* Convert LayerAnnotations sets to keys during pickling
* Warn on MapLayerAnnotations, due to remote execution issues.
@sjperkins sjperkins marked this pull request as ready for review October 27, 2020 19:51
@sjperkins
Copy link
Member Author

I'd say this PR is ready for review.

@mrocklin
Copy link
Member

Thanks for starting this @sjperkins

My initial reaction is that this seems somewhat verbose. I wonder if there are ways that we can simplify the implementation and the UI.

For the stack, I wonder if it would make sense to reuse the existing dask.config functionality

@contextmanager
def annotate(**kwargs):
    with dask.config.set({"annotations": kwargs}):
        yield

I suspect that this doesn't work directly due to nesting. It looks like it might work though if we change around keys a bit. Here is a brief example

In [5]: with dask.config.set({"a.b": 1}):
   ...:     with dask.config.set({"a.c": 2}):
   ...:         print(dask.config.get("a"))
   ...: 
   ...: 
{'b': 1, 'c': 2}

In [6]: with dask.config.set({"a": {"b": 1}}):
   ...:     with dask.config.set({"a": {"c": 2}}):
   ...:         print(dask.config.get("a"))
   ...: 
   ...: 
{'c': 2}

To avoid the three classes of annotations, I'm inclined to just leave the values attached to the Layer as-is, and not worry too much about making a taxonomy just yet. This seems premature.

Also, regarding tests, I would love to see tests only use user facing API. This both helps to communicate intent to other devs, and also lets future devs modify the implementation and still get the benefit of the existing test suite. To that end, I would expect tests to look a lot simpler, maybe something like the following:

def test_annotations():
    with dask.annotate(a=1):
        x = da.ones(10)

        with dask.annotate(b=toolz.second):
            y = x + 1

    assert y.__dask_graph__().layers[x.name].annotations == {"a": 1}
    assert y.__dask_graph__().layers[y.name].annotations == {"a": 1, "b": toolz.second}

In writing this test I'm actually a little sad about having to go into the implementation details of the HLG. I'll be more excited about testing this when I can bring up the distributed scheduler and use priorities and things. But at least I'm not constructing layers by hand or dealing with custom Annotation subclasses.

Thoughts on any of this? I tend to be more allergic to building systems than most developers, so I might be overreacting here.

@sjperkins
Copy link
Member Author

I'm happy to use dask.config.

Thanks for starting this @sjperkins

My initial reaction is that this seems somewhat verbose. I wonder if there are ways that we can simplify the implementation and the UI.

Also, regarding tests, I would love to see tests only use user facing API. This both helps to communicate intent to other devs, and also lets future devs modify the implementation and still get the benefit of the existing test suite. To that end, I would expect tests to look a lot simpler, maybe something like the following:

In writing this test I'm actually a little sad about having to go into the implementation details of the HLG. I'll be more excited about testing this when I can bring up the distributed scheduler and use priorities and things. But at least I'm not constructing layers by hand or dealing with custom Annotation subclasses.

Thoughts on any of this? I tend to be more allergic to building systems than most developers, so I might be overreacting here.

In general I agree with you, and I've reworked the tests to use the high-level Arrays user interfaces. The user would normally never see these objects.

To avoid the three classes of annotations, I'm inclined to just leave the values attached to the Layer as-is, and not worry too much about making a taxonomy just yet. This seems premature.

Are you suggesting something like: Layer.annotations == {"a": 1, "b": func}

The reason I've refrained from allowing callables within annotations is due to remote execution security concerns as per these comments: #6701 (comment) and #6701 (comment). This is actually one of the main reasons for the LayerAnnotation classes: To disallow this while attempting to provide some sort of compression of annotations for transfer to the scheduler.

Would the expansion of annotations be more appropriate on the Layer class itself? I'm happy to experiment.

@mrocklin
Copy link
Member

mrocklin commented Nov 4, 2020

Are you suggesting something like: Layer.annotations == {"a": 1, "b": func}

Yes.

The reason I've refrained from allowing callables within annotations is due to remote execution security concerns as per these comments: #6701 (comment) and #6701 (comment). This is actually one of the main reasons for the LayerAnnotation classes: To disallow this while attempting to provide some sort of compression of annotations for transfer to the scheduler.

I think that the system that consumes the collections and sends along the annotations can handle this for us before it gets to the scheduler.

Code-wise I think that Dask core should not be prescriptive about how we represent annnotations. Instead, I think that we will handle this in Client._graph_to_futures or something similar. My guess is that that method will take these annotations and convert them all to either single elements or dictionaries. So we would call the function in the dask.distributed codebase, but not here.

@sjperkins
Copy link
Member Author

Are you suggesting something like: Layer.annotations == {"a": 1, "b": func}

Yes.

The reason I've refrained from allowing callables within annotations is due to remote execution security concerns as per these comments: #6701 (comment) and #6701 (comment). This is actually one of the main reasons for the LayerAnnotation classes: To disallow this while attempting to provide some sort of compression of annotations for transfer to the scheduler.

I think that the system that consumes the collections and sends along the annotations can handle this for us before it gets to the scheduler.

Code-wise I think that Dask core should not be prescriptive about how we represent annnotations. Instead, I think that we will handle this in Client._graph_to_futures or something similar. My guess is that that method will take these annotations and convert them all to either single elements or dictionaries. So we would call the function in the dask.distributed codebase, but not here.

OK, that makes things easier, I'll make those changes.

dask/core.py Outdated


@contextmanager
def annotate(**kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we could us a docstring here. My guess is that this will be the most commonly read piece of information about annotations. https://docs.dask.org/en/latest/develop.html#docstrings

In particular, I suspect that it will be useful to include a couple of common examples, perhaps with special values from dask.distributed like priority and retries`

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I recommend that we move this to base.py. I think that core.py is mostly based around direct graph manipulation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might also allow us to do from . import config, which might allow us to avoid the underscore namespacing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to base.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some documentation. More could be added when HLG transmission to the scheduler is implemented, especially in relation to SchedulerPlugins. It would probably be good to mention the current annotation taxonomy understood by the distributed scheduler (priority, retry, worker, resource).

I've refrained at this point because this isn't yet functional. What do you think?


alayer = A.__dask_graph__().layers[A.name]
assert alayer.annotations == annotation
assert dask.config.get("annotations", None) is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to verify that a layer created outside of the context manager doesn't have annotations

Suggested change
assert dask.config.get("annotations", None) is None
assert dask.config.get("annotations", None) is None
b = A + 1
assert not b.__dask_graph__().layers[b.name].annotations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this produces a Blockwise object for the layer (not a BasicLayer)

B = A + 1
assert isinstance(B.__dask_graph__().layers[B.name], Blockwise)

The BasicLayer does live in Blockwise's _cached_dict attribute, but this feels like digging into the implementation a bit too far?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test_no_annotations test case instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would hope that all Layer classes support annotations, not just BasicLayer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move annotation attachment to Layer.__init__

Comment on lines +255 to +256
annotations = config.get("annotations", None)
self.annotations = None if annotations is None else annotations.copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should move to a new Layer.__init__ function and all layer subclasses should call super? cc also @madsbk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, thoughts on this? It's a purely aesthetic change and not very important.

Suggested change
annotations = config.get("annotations", None)
self.annotations = None if annotations is None else annotations.copy()
self.annotations = copy.copy(config.get("annotations", None))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should move to a new Layer.__init__ function and all layer subclasses should call super? cc also @madsbk

That is fine with me.

Copy link
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I made a couple of tiny suggestions on the docstring, but everything here seems both simple and powerful to me.

--------

All tasks within array A should have priority 100 and be retried 3 times
on failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on failure.
on failure.

dask/base.py Outdated
All tasks within array A should have priority 100 and be retried 3 times
on failure.
>>> with dask.annotate(priority=100, retries=3):
>>> A = da.ones((10000, 10000))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> A = da.ones((10000, 10000))
... A = da.ones((10000, 10000))

dask/base.py Outdated

>>> nblocks = (10, 10)
>>> with dask.annotate(priority=lambda k: k[1]*nblocks[1] + k[2]):
>>> A = da.ones((1000, 1000), chunks=(100, 100))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> A = da.ones((1000, 1000), chunks=(100, 100))
... A = da.ones((1000, 1000), chunks=(100, 100))

@sjperkins
Copy link
Member Author

This looks good to me. I made a couple of tiny suggestions on the docstring, but everything here seems both simple and powerful to me.

Thanks for the review @mrocklin . I've fixed the docstrings

@dhirschfeld
Copy link

Just a by-the-by, if you use IPython, the doctest_mode can be very handy for creating examples for copy/pasting:

In [1]: %doctest_mode
Exception reporting mode: Plain
Doctest mode is: ON
>>> for i in range(5):
...     print(i)
...
0
1
2
3
4
>>> %doctest_mode
Exception reporting mode: Context
Doctest mode is: OFF

In [4]:

@mrocklin
Copy link
Member

mrocklin commented Nov 4, 2020

Ooh, that is nice.

@mrocklin
Copy link
Member

mrocklin commented Nov 5, 2020

This looks great to me. Merging. Thanks for your continued work on this @sjperkins ! It's nice to get something in after all of that effort :)

For my part, I'm glad that we were able to do this this cleanly :)

@mrocklin mrocklin merged commit 7968e85 into dask:master Nov 5, 2020
@sjperkins sjperkins deleted the layer-annotations branch November 5, 2020 06:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants