Skip to content

A bit of parquet-refactoring progress#12

Merged
mrocklin merged 8 commits intomrocklin:parquet-refactorfrom
rjzamora:pq-metadata
Jun 14, 2019
Merged

A bit of parquet-refactoring progress#12
mrocklin merged 8 commits intomrocklin:parquet-refactorfrom
rjzamora:pq-metadata

Conversation

@rjzamora
Copy link
Collaborator

Minor refactoring changes to leverage recent work in arrow.parquet (i.e. ARROW-1983/PR#4405).

Status: 91 tests passing, 36 failing (only 16 failing for pyarrow engine)

  • Tests added / passed
  • Passes flake8 dask

@rjzamora
Copy link
Collaborator Author

BTW, doing some cleanup now. Many of the failing tests for pyarrow-only are just a result of the "simplification" of the PR/proposal in PR#4336.

For example, pyarrow and fastparquet were originally doing slightly different things to preserve and restore an index through a round trip. If we assume that a correct index argument must be supplied to the read_parquet call (no auto-detection), then some of the failing tests can be removed, etc.

Copy link
Owner

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora ! A few small comments.

# Read from each individual piece (quite possibly slow)
row_groups = [
piece.get_metadata(
lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the fs.open bit is here to help support file systems that are different than the local Posix system. This might be something like S3, GCS, or HDFS.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might no longer be necessary, I don't know, but I suspect that there is some Parquet + S3 test already in the test suite to help verify.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we shouldn't need this anymore, because the open_file_func is set by the ParquetDataset constructor (in here). However, since this change was just to avoid a deprecation warning, we can easily revert if needed.

storage_options=None,
engine="auto",
gather_statistics=True,
infer_divisions=False,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that my original intent was to remove this keyword, and replace it with gather_statistics=

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - I had meant to remove this.

@rjzamora
Copy link
Collaborator Author

I have worked through many of the tests for the pyarrow engine, and currently have only 4 tests failing. Here is a grep for FAILED in the pytest output:

100:dask/dataframe/io/tests/test_parquet.py::test_categories[pyarrow] FAILED
102:dask/dataframe/io/tests/test_parquet.py::test_timestamp_index[pyarrow] FAILED
119:dask/dataframe/io/tests/test_parquet.py::test_columns_name[pyarrow] FAILED
160:dask/dataframe/io/tests/test_parquet.py::test_datasets_timeseries FAILED
368:= 4 failed, 78 passed, 65 skipped, 4 xfailed, 1 xpassed, 17 warnings in 7.88 seconds =

A few notes:

  • The test_timestamp_index and test_datasets_timeseries tests will pass if you use something like assert_eq(ddf1.compute(), ddf2.compute()).
  • This line of test_categories is failing, because cats_set has a different ordering than the truth value.
  • test_columns_name is failing because df.columns.name is not making it through the round trip.
  • I am skipping the tests for append=True
  • I got many of the other tests to pass just by specifying the index when reading. Also, in cases where the dataframe's index has no name before the write, it will be given the name 'index' by default (meaning that the default index name wil be 'index' instead of None after it is read in). This PR largely assumes that the index will be preserved as a column in to_parquet, so it is up to the user to specify the correct index to read_parquet (i.e. I am not trying to do anything fancy to make the index preservation invisible to the user).

@mrocklin @martindurant Please let me know if we want/need the behavior of to_parquet and read_parquet to be more consistent before and after the refactor. I am mostly interested to know how appropriate/innappropriate it is to significantly reduce the reader's ability to automatically detect the index.


@staticmethod
def read_metadata(
fs, fs_token, paths, categories=None, index=None, gather_statistics=None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about the change to drop the fs_token argument. Was it no longer necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it should be used, but I removed it wasn't being used (and I had come accross a recommendation to remove it in an earlier code review)

ddf.to_parquet(fn, write_index=index, engine=write_engine)
read_df = dd.read_parquet(fn, engine=read_engine)
if index:
read_df = dd.read_parquet(fn, index='a', engine=read_engine)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just index=index?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index is a bool here, so index=index should fail (maybe I am missunderstanding?)

@mrocklin mrocklin merged commit 4ba4ebf into mrocklin:parquet-refactor Jun 14, 2019
@mrocklin
Copy link
Owner

Merging in. @rjzamora I'm also giving you commit rights to my fork so you should be able to push directly in the future.

@rjzamora rjzamora deleted the pq-metadata branch June 14, 2019 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants