Preserve HighLevelGraphs in DataFrame.from_delayed#8174
Preserve HighLevelGraphs in DataFrame.from_delayed#8174jrbourbeau merged 2 commits intodask:mainfrom
DataFrame.from_delayed#8174Conversation
|
Nice - Thanks @gjoseph92 ! |
|
Extra thanks!!! I believe this closes #7851 |
|
Thanks Gabe! 😄 cc @quasiben (for awareness) |
| dsk = merge(df.dask for df in dfs) | ||
| dsk = {} |
There was a problem hiding this comment.
Am wondering about this change in the context of issue ( #8292 )
There was a problem hiding this comment.
I think this is okay given we're now including the graphs in dfs further down in HighLevelGraph.from_collections
There was a problem hiding this comment.
Right that's the question. Are we missing things that would have been here before? Chris' issue suggests this is a maybe
There was a problem hiding this comment.
I have run into strange problems before with graphs that mix array/dataframe. (e.g., #7545). Could that be at issue here?
There was a problem hiding this comment.
Are we missing things that would have been here before?
I don't think we are, unless HighLevelGraph.from_collections(name, {}, dfs) does not ultimately produce the same graph as merge(df.dask for df in dfs). My gut feeling here is that there's a HighLevelGraph bug that we simply were sidestepping before by doing the low-level merge.
We're currently materializing all HighLevelGraphs of the inputs and merging them as plain dicts. This is both inefficient, and loses the potential for HLG optimization when roundtripping from DataFrame -> delayed -> DataFrame.
cc @rjzamora
black dask/flake8 dask/isort dask