A bit of parquet-refactoring progress#12
Conversation
|
BTW, doing some cleanup now. Many of the failing tests for pyarrow-only are just a result of the "simplification" of the PR/proposal in PR#4336. For example, pyarrow and fastparquet were originally doing slightly different things to preserve and restore an index through a round trip. If we assume that a correct |
| # Read from each individual piece (quite possibly slow) | ||
| row_groups = [ | ||
| piece.get_metadata( | ||
| lambda fn: pq.ParquetFile(fs.open(fn, mode="rb")) |
There was a problem hiding this comment.
I think that the fs.open bit is here to help support file systems that are different than the local Posix system. This might be something like S3, GCS, or HDFS.
There was a problem hiding this comment.
This might no longer be necessary, I don't know, but I suspect that there is some Parquet + S3 test already in the test suite to help verify.
There was a problem hiding this comment.
Right, we shouldn't need this anymore, because the open_file_func is set by the ParquetDataset constructor (in here). However, since this change was just to avoid a deprecation warning, we can easily revert if needed.
dask/dataframe/io/parquet/core.py
Outdated
| storage_options=None, | ||
| engine="auto", | ||
| gather_statistics=True, | ||
| infer_divisions=False, |
There was a problem hiding this comment.
I think that my original intent was to remove this keyword, and replace it with gather_statistics=
There was a problem hiding this comment.
Oops - I had meant to remove this.
|
I have worked through many of the tests for the A few notes:
@mrocklin @martindurant Please let me know if we want/need the behavior of |
|
|
||
| @staticmethod | ||
| def read_metadata( | ||
| fs, fs_token, paths, categories=None, index=None, gather_statistics=None |
There was a problem hiding this comment.
I'm curious about the change to drop the fs_token argument. Was it no longer necessary?
There was a problem hiding this comment.
I'm not sure if it should be used, but I removed it wasn't being used (and I had come accross a recommendation to remove it in an earlier code review)
| ddf.to_parquet(fn, write_index=index, engine=write_engine) | ||
| read_df = dd.read_parquet(fn, engine=read_engine) | ||
| if index: | ||
| read_df = dd.read_parquet(fn, index='a', engine=read_engine) |
There was a problem hiding this comment.
index is a bool here, so index=index should fail (maybe I am missunderstanding?)
|
Merging in. @rjzamora I'm also giving you commit rights to my fork so you should be able to push directly in the future. |
Minor refactoring changes to leverage recent work in
arrow.parquet(i.e. ARROW-1983/PR#4405).Status: 91 tests passing, 36 failing (only 16 failing for
pyarrowengine)flake8 dask