Skip to content

Track HLG layer name in Delayed#8452

Merged
jsignell merged 10 commits intodask:mainfrom
gjoseph92:delayed/track-layer-name
Dec 20, 2021
Merged

Track HLG layer name in Delayed#8452
jsignell merged 10 commits intodask:mainfrom
gjoseph92:delayed/track-layer-name

Conversation

@gjoseph92
Copy link
Collaborator

As discussed in #8173, it's necessary for Delayed and bag.Item objects to be able to keep track of the HLG layer name they're referencing. When constructed with to_delayed(optimize_graph=False), this layer name will not be the same as the key (array-abcde vs ('array-abcde', 0)).

# NOTE: Layer only used by `Item.from_delayed`, to handle Delayed objects created by other collections.
# e.g.: Item.from_delayed(da.ones(1).to_delayed()[0])
# See Delayed.__init__
self._layer = layer or key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._layer = layer or key
if layer:
self._layer = layer
elif isinstance(key, tuple):
self._layer = key[0]
else:
self._layer = key

dsk = self.__dask_optimize__(dsk, keys)
return [Delayed(k, dsk) for k in keys]
layer = None
return [Delayed(k, dsk, layer=layer) for k in keys]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this coherent with dask.array above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually feel like setting the layer at all here is a bit unnecessary, since the graphs are always low-level anyway, but sure.

self._length = length

# NOTE: Layer is used by `to_delayed` in other collections, but not in normal Delayed use
self._layer = layer or key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._layer = layer or key
if layer:
self._layer = layer
elif isinstance(key, tuple):
self._layer = key[0]
else:
self._layer = key

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally didn't do this because I'd rather be explicit about passing the layer name when it's needed, and not make any assumptions about key structure in Delayed. The self._layer not in dsk.layers check below should catch any misuse. Since __init__ should only be used internally, not directly by users, this seemed reasonable to me.

Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_daily_stock_deprecated failure is unrelated to the changes here. xref #8477 which will remove that test altogether.

gjoseph92 added a commit to gjoseph92/dask that referenced this pull request Dec 13, 2021
In dask#8452 I realized that an incorrect pattern had emerged from dask#6510 of including
```python
    if not isinstance(dsk, HighLevelGraph):
        dsk = HighLevelGraph.from_collections(id(dsk), dsk, dependencies=())
```
in optimization functions. Specifically, `id(dsk)` is incorrect as the layer name here. The layer name must match the `.name` of the resulting collection that gets created by `__dask_postpersist__()`, otherwise `__dask_layers__()` on the optimized collection will be wrong. Since `optimize` doesn't know about collections and isn't passed a layer name, the only reasonable thing to do here is to error when given a low-level graph.
This is safe to do for Arrays and DataFrames, since their constructors convert any low-level graphs to HLGs.

This PR doesn't really fix anything—the code path removed should be unused—but it eliminates a confusing pattern that has already wandered its way into other places dask#8316 (comment).
@gjoseph92
Copy link
Collaborator Author

Failing tests now seem to be only #8477, #8480, Windows CI running out of memory spawning subprocesses in test_set_index_consistent_divisions and test_reuse_pool.

@jsignell jsignell merged commit fa15d0c into dask:main Dec 20, 2021
aeisenbarth pushed a commit to aeisenbarth/dask that referenced this pull request Jan 6, 2022
As discussed in dask#8173, it's necessary for `Delayed` and `bag.Item` objects to be able to keep track of the HLG layer name they're referencing. When constructed with `to_delayed(optimize_graph=False)`, this layer name will not be the same as the key (`array-abcde` vs `('array-abcde', 0)`).

Co-authored-by: crusaderky <crusaderky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

to_delayed(optimize_graph=False) objects often have incorrect __dask_layers__

4 participants