Skip to content

Preserve HighLevelGraphs in DataFrame.from_delayed#8174

Merged
jrbourbeau merged 2 commits intodask:mainfrom
gjoseph92:dataframe-from-delayed-preserve-hlg
Sep 24, 2021
Merged

Preserve HighLevelGraphs in DataFrame.from_delayed#8174
jrbourbeau merged 2 commits intodask:mainfrom
gjoseph92:dataframe-from-delayed-preserve-hlg

Conversation

@gjoseph92
Copy link
Collaborator

We're currently materializing all HighLevelGraphs of the inputs and merging them as plain dicts. This is both inefficient, and loses the potential for HLG optimization when roundtripping from DataFrame -> delayed -> DataFrame.

cc @rjzamora

  • Tests added / passed
  • Passes black dask / flake8 dask / isort dask

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gjoseph92!

@jrbourbeau jrbourbeau merged commit 9460bc4 into dask:main Sep 24, 2021
@rjzamora
Copy link
Member

Nice - Thanks @gjoseph92 !

@chrisroat
Copy link
Contributor

Extra thanks!!! I believe this closes #7851

@jakirkham
Copy link
Member

Thanks Gabe! 😄

cc @quasiben (for awareness)

Comment on lines -614 to +615
dsk = merge(df.dask for df in dfs)
dsk = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am wondering about this change in the context of issue ( #8292 )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right that's the question. Are we missing things that would have been here before? Chris' issue suggests this is a maybe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run into strange problems before with graphs that mix array/dataframe. (e.g., #7545). Could that be at issue here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing things that would have been here before?

I don't think we are, unless HighLevelGraph.from_collections(name, {}, dfs) does not ultimately produce the same graph as merge(df.dask for df in dfs). My gut feeling here is that there's a HighLevelGraph bug that we simply were sidestepping before by doing the low-level merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants