Avoid reading _metadata on every worker#6017
Conversation
|
@ig248 - Note that this PR addresses some of your suggestions from #5842 (although the current changes seem most helpful for non-partitioned datasets, since the |
|
cc @martindurant if you have a moment to review |
|
Is this still WIP, or are you ready for me to have a look, @rjzamora ? |
|
Thanks @martindurant! I am not planning to add anything here today, so a review at your convenience is certainly appreciated :) |
martindurant
left a comment
There was a problem hiding this comment.
Glad to see it passing! Do we have benchmarks?
Unfortunately, covering every case means regaining some of the complexity that the original refactor alleviated :|
| # Create `parts` | ||
| # This is a list of row-group-descriptor dicts, or file-paths | ||
| # if we have a list of files and gather_statistics=False | ||
| base_path = (base_path or "") + fs.sep |
There was a problem hiding this comment.
If we're doing direct path manipulations, we should maybe start a HTTP server and test against it
| # a "_metadata" file for the worker to read. | ||
| # Therefore, we need to pass the pf object in | ||
| # the task graph | ||
| pf_deps = pf |
There was a problem hiding this comment.
There was some code to strip down the pf instance to avoid thrift serialisation costs. Is that still happening?
| for i, piece in enumerate(partsin): | ||
| if pf and not fast_metadata: | ||
| for col in piece.columns: | ||
| col.meta_data.statistics = None |
There was a problem hiding this comment.
I think this is what I was referring to above, making the pf instance smaller
There was a problem hiding this comment.
Ah - Right! That should certainly stay in.
| col.meta_data.statistics = None | ||
| col.meta_data.encoding_stats = None | ||
| piece_item = i if pf else piece | ||
| if partitions and fast_metadata: |
There was a problem hiding this comment.
The three conditions here are a bit hard to parse. Of course, the situation is complicated. Perhaps up front, we should enumerate the cases and label the branches as required:
- a single file
- a directory of files with a _metadata
- a directory of files without _metadata
- stats requested
- stats unnecessary
|
Failure is a RuntimeWarning on py38 in test_cov (array?) - perhaps a new compiler version of numpy? In other words, unrelated to this PR. |
Just a note: I will revisit this soonish, but I did not see significant performance improvements from these changes when I last checked. For the partitioned dataset case, we no longer spend much time in |
|
Can you please merge from master? I believe things should pass now. |
|
@martindurant did you want to give this another look, or was #6017 (comment) saying this was good? |
ClosesPartially addresses #5842For
FastParquetEngine, we are currently re-reading the "_metadata" file for every partition in many cases. This PR will avoid doing so whenever possible.black dask/flake8 dask