Use map_partitions (Blockwise) in to_parquet by rjzamora · Pull Request #8487 · dask/dask

rjzamora · 2021-12-14T22:09:19Z

~~This PR adds a new DataFrameOutputLayer and uses it to move the data-writing component of to_parquet into a proper HighLevelGraph Layer class~~ [EDIT: This PR now uses map_partitions (with ~~partition_info~~ a new BlockIndex(BlockwiseDep) class) in lieu of a new DataFrameOutputLayer]. Note that this PR required the column-projection changes from #8453, because the current behavior in main is actually a bug.

With these changes, the result of ddf.visualize(optimize_graph=True) (from this reproducer) is now:

(The "read-parquet", "rename", "reset-index", and "to-parquet" tasks are all fused into "to-parquet")

Closes Suboptimal graph structure when read-writing a parquet #8445
Tests added / passed
Passes pre-commit run --all-files

dask/dataframe/io/parquet/core.py

…write

rjzamora · 2022-01-19T23:25:40Z

Is anyone interested in reviewing this? :D (cc @jrbourbeau, @jsignell, @ian-r-rose, @gjoseph92)

dask/dataframe/io/parquet/core.py

gjoseph92 · 2022-01-20T03:35:43Z

dask/dataframe/io/parquet/core.py

+            kwargs_pass,
+        ),
+        token="to-parquet-"
+        + tokenize(


What if you implement __dask_tokenize__ on ToParquetFunctionWrapper? Then I'd imagine the tokenization logic already in map_partitions would just work.

If we don't pass in token= explicitly, then map_partitions will produce a layer name of the form: f"{funcname(func)}-{tokenize(func, meta, *args, **kwargs)}". Therefore, we would also need to modify the ToParquetFunctionWrapper name to be "to-parquet". Is it worthwhile to add these defintions to ToParquetFunctionWrapper when we can just establish the name here?

Wait, the token= argument to map_partitions is very much mis-named. It should really be called name:

dask/dask/dataframe/core.py

Lines 6028 to 6037 in a504115

name = kwargs.pop("token", None)

parent_meta = kwargs.pop("parent_meta", None)

assert callable(func)

if name is not None:

token = tokenize(meta, *args, **kwargs)

else:

name = funcname(func)

token = tokenize(func, meta, *args, **kwargs)

name = f"{name}-{token}"

With what you have here, you'll get two tokens on the final name (map_partitions will append one automatically). I think just token="to-parquet" and a __dask_tokenize__ method on ToParquetFunctionWrapper (and BlockwiseDep) is what you want here.

Wait, the token= argument to map_partitions is very much mis-named. It should really be called name

Agree

With what you have here, you'll get two tokens on the final name

Agree - I didn't realize this before, and I certainly don't like this.

I think just token="to-parquet" and a dask_tokenize method on ToParquetFunctionWrapper (and BlockwiseDep) is what you want here.

If we specify token="to-parquet", then I don't think ToParquetFunctionWrapper.__dask_tokenize__ will ever be used. In order to avoid making changes in map_partitions, I vote that we just define the __repr__ of ToParquetFunctionWrapper to be "to-parquet", and define __dask_tokenize__ (as you suggested).

dask/dataframe/io/parquet/core.py

dask/dataframe/io/tests/test_parquet.py

dask/layers.py

…_info

dask/dataframe/io/parquet/core.py

dask/blockwise.py

gjoseph92

Tiny nits, but I think this is ready, right? Just blocked by #8453?

dask/dataframe/io/parquet/core.py

rjzamora · 2022-01-21T18:20:49Z

I think this is ready, right? Just blocked by #8453?

Thank for all the wonderful reviewing @gjoseph92 ! I hadn't really considered #8453 to be a blocker, since the primary purpose of that PR (in my mind) is the creation_info feature. However, I now see that that PR includes changes that we have not really "agreed on," but have been copied into this PR. Perhaps we should just agree on the pertinent changes here:

Does it make sense for DataFrameLayer to still exist now that it only acting as a label?
Do we really need to rease a FutureWarning when the "old" behavior of project_columns is encountered?

My interest in leaving DataFrameLayer around for now (1), is that I do intend to use it to store necessary information for multi-layer column projection (and evenutally predicate pushdown). For example, we may use that class to define a base required_columns method designed to return the input columns required to produce a specific set of output columns (for that specific Layer). This would be similar to project_columns, but we would be returning a set of column names, rahter than a new Layer.

Regarding (2): I have no problem with removing the deprecation warning and just letting old behavior fail. However, I seem to remember @ian-r-rose telling me that there is at least one DataFrameIOLayer definition in down-stream code.

gjoseph92 · 2022-01-21T19:06:23Z

Personally, I'd prefer merging #8453 as a separate PR just to keep the git history cleaner. Nice to be able to refer back to the PR that introduced a change and figure out the intention for it.

…write

dask/dataframe/io/parquet/core.py

…write

rjzamora · 2022-01-26T18:20:30Z

@martindurant - Do you know what's going on with the test_parquet.py::test_timestamp96 CI failures? Is this maybe being cause by this PR (or my recent changes)?

gjoseph92 · 2022-01-26T17:21:36Z

dask/dataframe/io/parquet/arrow.py

-    @staticmethod
-    def write_metadata(parts, fmd, fs, path, append=False, **kwargs):
+    @classmethod
+    def write_metadata(cls, parts, meta, fs, path, append=False, **kwargs):


Why the change? cls appears unused

Yeah - It was just defined as a classmethod in utils.py, so I changed this to agree - but I guess it could be static if cls isn't used.

dask/dataframe/io/parquet/core.py

martindurant · 2022-01-26T18:52:51Z

Do you know what's going on with the test_parquet.py::test_timestamp96

pandas 1.4.0... #8626

gjoseph92

@rjzamora nice work, this seems good to me and will be a big improvement for users.

I did just realize we should probably open issues for converting the other to_* methods to use blockwise similarly.

rjzamora · 2022-01-26T22:37:09Z

Thanks for the reviews @gjoseph92 !

Just to avoid confusion - I'll probably wait to merge this until after the fastparquet failures are passing on main (and here).

…write

gjoseph92 · 2022-01-28T01:44:25Z

@rjzamora looks like tons of parquet tests failed only on windows 3.7. I'd assume this is all flakyness (workers couldn't start or something?), but slightly concerning to me that when that happened, the computation succeeded and just returned an unexpected result. I wonder if there's a race condition or something in those tests, and we could make them more resilient?

rjzamora · 2022-01-28T18:21:34Z

looks like tons of parquet tests failed only on windows 3.7. I'd assume this is all flakyness (workers couldn't start or something?), but slightly concerning to me that when that happened, the computation succeeded and just returned an unexpected result. I wonder if there's a race condition or something in those tests, and we could make them more resilient?

Yeah - I haven't looked carefully through the windows 3.7 failures since many of those tests have been failing on all PRs for a while, but it does make sense to investigate ways to avoid flakyness and especially "silent" failures.

@gjoseph92 & @jrbourbeau - Is there any concern with merging this particular PR (given that the CI failures are the same here as elsewhere)?

…write

jsignell · 2022-02-02T14:55:58Z

I am not concerned about merging this PR with failures that match those on main.

rjzamora · 2022-02-02T22:06:58Z

I am not concerned about merging this PR with failures that match those on main.

Sounds good - I don't think anything is likely to change here, so I'll get it out of the way.
(It will also be nice to get task fusion in to_parquet!)

add DataFrameOutputLayer and use for to_parquet

1935b4c

github-actions bot added dataframe io labels Dec 14, 2021

rjzamora added the parquet label Dec 14, 2021

use Blockwise with partition_info instead of DataFrameOutputLayer

f3b67c9

rjzamora changed the title ~~Use DataFrameOutputLayer in to_parquet~~ Use map_partitions (Blockwise) in to_parquet Dec 20, 2021

rjzamora commented Dec 20, 2021

View reviewed changes

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

rjzamora added 6 commits January 6, 2022 09:17

Update dask/dataframe/io/parquet/core.py

b84abbd

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

087256d

…write

trigger formatting

2a2a187

roll back period removal

d0c5a53

fix space?

afb82a3

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

cb38bc7

…write

rjzamora added the highlevelgraph Issues relating to HighLevelGraphs. label Jan 19, 2022

gjoseph92 self-requested a review January 19, 2022 23:48

ian-r-rose self-requested a review January 19, 2022 23:57

gjoseph92 reviewed Jan 20, 2022

View reviewed changes

revisions after code review - using BlockwiseDep instead of partition…

d96c077

…_info

gjoseph92 reviewed Jan 20, 2022

View reviewed changes

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

dask/blockwise.py Show resolved Hide resolved

rjzamora added 2 commits January 20, 2022 11:43

comment updates

4ea523f

use map_partitions tokenization

7161166

gjoseph92 reviewed Jan 21, 2022

View reviewed changes

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

final code-review cleanup

e052fb8

gjoseph92 mentioned this pull request Jan 21, 2022

Add optional information about originating function call in DataFrameIOLayer #8453

Merged

rjzamora added 3 commits January 21, 2022 18:05

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

a030735

…write

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

75419ec

…write

start adding tree layer to dataframe-io-layer-write

f1eaf82

rjzamora added 5 commits January 25, 2022 10:47

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

8fd73c9

…write

Merge remote-tracking branch 'upstream/main' into to-parquet-plus-tree

6ff8f50

save working state that I will likely roll back

ff82f76

use all-to-one tree reduction for now

ec2e414

Merge branch 'to-parquet-plus-tree' into dataframe-io-layer-write

55e8e12

rjzamora commented Jan 25, 2022

View reviewed changes

dask/dataframe/io/parquet/core.py Outdated Show resolved Hide resolved

rjzamora added 2 commits January 25, 2022 15:49

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

2139e31

…write

roll back DataFrameTreeReduction change - Save for later PR

02f146b

gjoseph92 reviewed Jan 26, 2022

View reviewed changes

use dask-keys instead of list comp

5f7d0aa

gjoseph92 approved these changes Jan 26, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

953773e

…write

rjzamora added 3 commits January 31, 2022 12:02

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

becf55d

…write

add comment about map_partitions meta

7d8f143

Merge remote-tracking branch 'upstream/main' into dataframe-io-layer-…

6ec0d17

…write

rjzamora merged commit d98c1dd into dask:main Feb 2, 2022

rjzamora deleted the dataframe-io-layer-write branch February 2, 2022 22:07

ian-r-rose mentioned this pull request Mar 24, 2022

dd.to_parquet() duplicates graph execution #6232

Closed

	name = kwargs.pop("token", None)
	parent_meta = kwargs.pop("parent_meta", None)

	assert callable(func)
	if name is not None:
	token = tokenize(meta, args, *kwargs)
	else:
	name = funcname(func)
	token = tokenize(func, meta, args, *kwargs)
	name = f"{name}-{token}"

Uh oh!

Conversation

rjzamora commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rjzamora commented Jan 19, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gjoseph92 Jan 20, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Jan 20, 2022

Choose a reason for hiding this comment

Uh oh!

gjoseph92 Jan 20, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Jan 21, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rjzamora commented Jan 21, 2022

Uh oh!

gjoseph92 commented Jan 21, 2022

Uh oh!

Uh oh!

rjzamora commented Jan 26, 2022

Uh oh!

gjoseph92 Jan 26, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Jan 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martindurant commented Jan 26, 2022

Uh oh!

gjoseph92 left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jan 26, 2022

Uh oh!

gjoseph92 commented Jan 28, 2022

Uh oh!

rjzamora commented Jan 28, 2022

Uh oh!

jsignell commented Feb 2, 2022

Uh oh!

rjzamora commented Feb 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rjzamora commented Dec 14, 2021 •

edited

Loading