Skip to content

Map on HighLevelGraph Layers#6689

Merged
jrbourbeau merged 14 commits intodask:masterfrom
madsbk:hlg_map_on_layers
Oct 13, 2020
Merged

Map on HighLevelGraph Layers#6689
jrbourbeau merged 14 commits intodask:masterfrom
madsbk:hlg_map_on_layers

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Sep 30, 2020

This PR introduces two HighLevelGraph methods:

    def map_basic_layers(
        self, func: Callable[[BasicLayer], Mapping]
    ) -> "HighLevelGraph":
        """Map `func` on each basic layer and returns a new high level graph.
        `func` should take a BasicLayer as input and return a new Mapping as output
        and **cannot** change the dependencies between Layers.
        If `func` returns a non-BasicLayer type, it will be wrapped in a `BasicLayer`
        object automatically.

        Parameters
        ----------
        func : callable
            The function to call on each BasicLayer

        Returns
        -------
        hlg : HighLevelGraph
            A high level graph containing the transformed BasicLayers and the other
            Layers untouched
        """

    def map_tasks(self, func: Callable[[Iterable], Iterable]) -> "HighLevelGraph":
        """Map `func` on all tasks and returns a new high level graph.
        `func` should take an iterable of the tasks as input and return a new
        iterable as output and **cannot** change the dependencies between Layers.

        Warning
        -------
        A layer is allowed to ignore the map on tasks that are part of its internals.
        For instance, Blockwise will only invoke `func` on the input literals.

        Parameters
        ----------
        func : callable
            The function to call on tasks

        Returns
        -------
        hlg : HighLevelGraph
            A high level graph containing the transformed tasks
        """

The motivation is to avoid materialization of high level graphs in Distributed: dask/distributed#4119

  • Tests added / passed
  • Passes black dask / flake8 dask

@madsbk madsbk marked this pull request as ready for review September 30, 2020 18:48
@madsbk madsbk changed the title Map on HighLevelGraph Layers [REVIEW] Map on HighLevelGraph Layers Oct 1, 2020
@mrocklin
Copy link
Member

mrocklin commented Oct 2, 2020

In principle this seems fine to me. At first I was thinking that maybe we should make a map_layers method instead, or make these private at first. I think that I'm inclined instead to mark layers generally as experimental and go ahead with things as they are.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work here @madsbk! Overall this looks good, I've left a few comments below

layers[k] = layer
else:
layers[k] = v
return HighLevelGraph(layers, self.dependencies)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also pass self.key_dependencies through here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In map_basic_layers() we cannot assume that key dependencies doesn't changes, func is allowed to modify both keys and tasks.
As an optimization, we could allow the user to provide a new key_dependencies but let's save that for another PR.

Comment on lines +103 to +106
if use_layer_map_task:
# Overwrite Blockwise.map_tasks with Layer.map_tasks
blockwise_layer = list(dsk.layers.values())[1]
blockwise_layer.map_tasks = partial(Layer.map_tasks, blockwise_layer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused as to what we're trying to test here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the description to more descriptive, hope it makes more sense?

    if use_layer_map_task:
        # In order to test the default map_tasks() implementation on a Blockwise Layer,
        # we overwrite Blockwise.map_tasks with Layer.map_tasks
        blockwise_layer = list(dsk.layers.values())[1]
        blockwise_layer.map_tasks = partial(Layer.map_tasks, blockwise_layer)

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @madsbk! Overall this looks good to me. I've left one small comment about a simplification we can make and another about a potential API change


return HighLevelGraph(ret_layers, ret_dependencies, ret_key_deps)

def map_basic_layers(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API question: What do you think about having a more generic map_layers method where the func being mapped is responsible for checking the layer type? This would be to prevent us from needing to add map_*_layers methods as the number of layer types grows.

Apologies for not bringing this point up earlier in the process

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense here is that this probably makes sense, but that it also probably doesn't need to be handled right now. The API choices here don't preclude us adding intermediate methods like map_layers later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we can revisit this point later on as needed 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API question: What do you think about having a more generic map_layers method where the func being mapped is responsible for checking the layer type? This would be to prevent us from needing to add map_*_layers methods as the number of layer types grows.

Make sense, my thinking behind map_basic_layers() was to have a map for the layers that are materialized low level graphs. And if the user wants to apply something on all layers, it is always possible to access the layers of a HLG directly.

But I agree, let's wait a bit before we settle on the API.

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all your work here @madsbk! Planning to merge this once CI passes

@jrbourbeau jrbourbeau changed the title [REVIEW] Map on HighLevelGraph Layers Map on HighLevelGraph Layers Oct 13, 2020
@jrbourbeau jrbourbeau merged commit 6c71a42 into dask:master Oct 13, 2020
kumarprabhu1988 pushed a commit to kumarprabhu1988/dask that referenced this pull request Oct 29, 2020
@madsbk madsbk deleted the hlg_map_on_layers branch February 16, 2021 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants