Avoid reading _metadata on every worker by rjzamora · Pull Request #6017 · dask/dask

rjzamora · 2020-03-16T23:19:25Z

~~Closes~~ Partially addresses #5842

For FastParquetEngine, we are currently re-reading the "_metadata" file for every partition in many cases. This PR will avoid doing so whenever possible.

Tests added / passed
Passes black dask / flake8 dask

…taset)

…ile path is known

rjzamora · 2020-03-17T03:33:07Z

@ig248 - Note that this PR addresses some of your suggestions from #5842 (although the current changes seem most helpful for non-partitioned datasets, since the "_metadata" file should no longer be re-read at all).

mrocklin · 2020-03-17T15:36:05Z

cc @martindurant if you have a moment to review

martindurant · 2020-03-17T15:37:58Z

Is this still WIP, or are you ready for me to have a look, @rjzamora ?

rjzamora · 2020-03-17T15:44:27Z

Thanks @martindurant! I am not planning to add anything here today, so a review at your convenience is certainly appreciated :)

martindurant

Glad to see it passing! Do we have benchmarks?

Unfortunately, covering every case means regaining some of the complexity that the original refactor alleviated :|

martindurant · 2020-03-17T19:25:36Z

dask/dataframe/io/parquet/fastparquet.py

        # Create `parts`
        # This is a list of row-group-descriptor dicts, or file-paths
        # if we have a list of files and gather_statistics=False
+        base_path = (base_path or "") + fs.sep


If we're doing direct path manipulations, we should maybe start a HTTP server and test against it

martindurant · 2020-03-17T19:26:48Z

dask/dataframe/io/parquet/fastparquet.py

+                # a "_metadata" file for the worker to read.
+                # Therefore, we need to pass the pf object in
+                # the task graph
+                pf_deps = pf


There was some code to strip down the pf instance to avoid thrift serialisation costs. Is that still happening?

martindurant · 2020-03-17T19:27:52Z

dask/dataframe/io/parquet/fastparquet.py

        for i, piece in enumerate(partsin):
-            if pf and not fast_metadata:
-                for col in piece.columns:
-                    col.meta_data.statistics = None


I think this is what I was referring to above, making the pf instance smaller

Ah - Right! That should certainly stay in.

martindurant · 2020-03-17T19:32:23Z

dask/dataframe/io/parquet/fastparquet.py

-                    col.meta_data.statistics = None
-                    col.meta_data.encoding_stats = None
-            piece_item = i if pf else piece
+            if partitions and fast_metadata:


The three conditions here are a bit hard to parse. Of course, the situation is complicated. Perhaps up front, we should enumerate the cases and label the branches as required:

a single file

a directory of files with a _metadata

a directory of files without _metadata

stats requested

stats unnecessary

martindurant · 2020-03-20T13:26:42Z

Failure is a RuntimeWarning on py38 in test_cov (array?) - perhaps a new compiler version of numpy? In other words, unrelated to this PR.

rjzamora · 2020-03-20T14:04:46Z

Glad to see it passing! Do we have benchmarks?

Just a note: I will revisit this soonish, but I did not see significant performance improvements from these changes when I last checked. For the partitioned dataset case, we no longer spend much time in _determine_pf_parts, but we still spend a lot of time creating pf for each partition.

martindurant · 2020-04-07T19:00:50Z

Can you please merge from master? I believe things should pass now.

TomAugspurger · 2020-04-22T14:18:15Z

@martindurant did you want to give this another look, or was #6017 (comment) saying this was good?

rjzamora added 5 commits March 16, 2020 12:39

avoid reading _metadata every time (unless working with paritioned da…

156bc5c

…taset)

adding some comments

3b4a778

avoid calls to _analyze_paths and _determine_pf_parts when specific f…

2e410bc

…ile path is known

cleanup some if logic

4adc007

minor style change

a72e546

rjzamora changed the title ~~[WIP] Avoid reading _metadata on every worker~~ Avoid reading _metadata on every worker Mar 17, 2020

martindurant reviewed Mar 17, 2020

View reviewed changes

rjzamora added 2 commits March 17, 2020 15:11

start addressing code review -- more to do

c8caa98

name cleanup

5482f29

martindurant mentioned this pull request Mar 26, 2020

Fix bugs in _metadata creation and filtering in parquet ArrowEngine #6023

Merged

2 tasks

Merge remote-tracking branch 'upstream/master' into fix-5842

c499a9f

martindurant merged commit 633de3f into dask:master Apr 22, 2020

rjzamora deleted the fix-5842 branch April 22, 2020 14:37

GenevieveBuckley mentioned this pull request Oct 13, 2021

_metadata is re-read for every partition when using fastparquet engine #5842

Closed

Uh oh!

Conversation

rjzamora commented Mar 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Mar 17, 2020

Uh oh!

mrocklin commented Mar 17, 2020

Uh oh!

martindurant commented Mar 17, 2020

Uh oh!

rjzamora commented Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant left a comment

Choose a reason for hiding this comment

Uh oh!

martindurant Mar 17, 2020

Choose a reason for hiding this comment

Uh oh!

martindurant Mar 17, 2020

Choose a reason for hiding this comment

Uh oh!

martindurant Mar 17, 2020

Choose a reason for hiding this comment

Uh oh!

rjzamora Mar 17, 2020

Choose a reason for hiding this comment

Uh oh!

martindurant Mar 17, 2020

Choose a reason for hiding this comment

Uh oh!

martindurant commented Mar 20, 2020

Uh oh!

rjzamora commented Mar 20, 2020

Uh oh!

martindurant commented Apr 7, 2020

Uh oh!

TomAugspurger commented Apr 22, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rjzamora commented Mar 16, 2020 •

edited

Loading

rjzamora commented Mar 17, 2020 •

edited

Loading