Skip to content

Avoid reading _metadata on every worker#6017

Merged
martindurant merged 8 commits intodask:masterfrom
rjzamora:fix-5842
Apr 22, 2020
Merged

Avoid reading _metadata on every worker#6017
martindurant merged 8 commits intodask:masterfrom
rjzamora:fix-5842

Conversation

@rjzamora
Copy link
Member

@rjzamora rjzamora commented Mar 16, 2020

Closes Partially addresses #5842

For FastParquetEngine, we are currently re-reading the "_metadata" file for every partition in many cases. This PR will avoid doing so whenever possible.

  • Tests added / passed
  • Passes black dask / flake8 dask

@rjzamora
Copy link
Member Author

@ig248 - Note that this PR addresses some of your suggestions from #5842 (although the current changes seem most helpful for non-partitioned datasets, since the "_metadata" file should no longer be re-read at all).

@mrocklin
Copy link
Member

cc @martindurant if you have a moment to review

@martindurant
Copy link
Member

Is this still WIP, or are you ready for me to have a look, @rjzamora ?

@rjzamora rjzamora changed the title [WIP] Avoid reading _metadata on every worker Avoid reading _metadata on every worker Mar 17, 2020
@rjzamora
Copy link
Member Author

rjzamora commented Mar 17, 2020

Thanks @martindurant! I am not planning to add anything here today, so a review at your convenience is certainly appreciated :)

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see it passing! Do we have benchmarks?

Unfortunately, covering every case means regaining some of the complexity that the original refactor alleviated :|

# Create `parts`
# This is a list of row-group-descriptor dicts, or file-paths
# if we have a list of files and gather_statistics=False
base_path = (base_path or "") + fs.sep
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing direct path manipulations, we should maybe start a HTTP server and test against it

# a "_metadata" file for the worker to read.
# Therefore, we need to pass the pf object in
# the task graph
pf_deps = pf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some code to strip down the pf instance to avoid thrift serialisation costs. Is that still happening?

for i, piece in enumerate(partsin):
if pf and not fast_metadata:
for col in piece.columns:
col.meta_data.statistics = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what I was referring to above, making the pf instance smaller

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - Right! That should certainly stay in.

col.meta_data.statistics = None
col.meta_data.encoding_stats = None
piece_item = i if pf else piece
if partitions and fast_metadata:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three conditions here are a bit hard to parse. Of course, the situation is complicated. Perhaps up front, we should enumerate the cases and label the branches as required:

  • a single file
  • a directory of files with a _metadata
  • a directory of files without _metadata
    • stats requested
    • stats unnecessary

@martindurant
Copy link
Member

Failure is a RuntimeWarning on py38 in test_cov (array?) - perhaps a new compiler version of numpy? In other words, unrelated to this PR.

@rjzamora
Copy link
Member Author

Glad to see it passing! Do we have benchmarks?

Just a note: I will revisit this soonish, but I did not see significant performance improvements from these changes when I last checked. For the partitioned dataset case, we no longer spend much time in _determine_pf_parts, but we still spend a lot of time creating pf for each partition.

@martindurant
Copy link
Member

Can you please merge from master? I believe things should pass now.

@TomAugspurger
Copy link
Member

@martindurant did you want to give this another look, or was #6017 (comment) saying this was good?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants