A bit of parquet-refactoring progress by rjzamora · Pull Request #12 · mrocklin/dask

rjzamora · 2019-06-12T20:33:20Z

Minor refactoring changes to leverage recent work in arrow.parquet (i.e. ARROW-1983/PR#4405).

Status: 91 tests passing, 36 failing (only 16 failing for pyarrow engine)

Tests added / passed
Passes flake8 dask

rjzamora · 2019-06-13T15:02:15Z

BTW, doing some cleanup now. Many of the failing tests for pyarrow-only are just a result of the "simplification" of the PR/proposal in PR#4336.

For example, pyarrow and fastparquet were originally doing slightly different things to preserve and restore an index through a round trip. If we assume that a correct index argument must be supplied to the read_parquet call (no auto-detection), then some of the failing tests can be removed, etc.

mrocklin

Thanks @rjzamora ! A few small comments.

mrocklin · 2019-06-13T15:14:10Z

dask/dataframe/io/parquet/arrow.py

                # Read from each individual piece (quite possibly slow)
                row_groups = [
-                    piece.get_metadata(
-                        lambda fn: pq.ParquetFile(fs.open(fn, mode="rb"))


I think that the fs.open bit is here to help support file systems that are different than the local Posix system. This might be something like S3, GCS, or HDFS.

This might no longer be necessary, I don't know, but I suspect that there is some Parquet + S3 test already in the test suite to help verify.

Right, we shouldn't need this anymore, because the open_file_func is set by the ParquetDataset constructor (in here). However, since this change was just to avoid a deprecation warning, we can easily revert if needed.

mrocklin · 2019-06-13T15:15:28Z

dask/dataframe/io/parquet/core.py

    storage_options=None,
    engine="auto",
    gather_statistics=True,
+    infer_divisions=False,


I think that my original intent was to remove this keyword, and replace it with gather_statistics=

Oops - I had meant to remove this.

rjzamora · 2019-06-13T23:13:28Z

I have worked through many of the tests for the pyarrow engine, and currently have only 4 tests failing. Here is a grep for FAILED in the pytest output:

100:dask/dataframe/io/tests/test_parquet.py::test_categories[pyarrow] FAILED
102:dask/dataframe/io/tests/test_parquet.py::test_timestamp_index[pyarrow] FAILED
119:dask/dataframe/io/tests/test_parquet.py::test_columns_name[pyarrow] FAILED
160:dask/dataframe/io/tests/test_parquet.py::test_datasets_timeseries FAILED
368:= 4 failed, 78 passed, 65 skipped, 4 xfailed, 1 xpassed, 17 warnings in 7.88 seconds =

A few notes:

The test_timestamp_index and test_datasets_timeseries tests will pass if you use something like assert_eq(ddf1.compute(), ddf2.compute()).
This line of test_categories is failing, because cats_set has a different ordering than the truth value.
test_columns_name is failing because df.columns.name is not making it through the round trip.
I am skipping the tests for append=True
I got many of the other tests to pass just by specifying the index when reading. Also, in cases where the dataframe's index has no name before the write, it will be given the name 'index' by default (meaning that the default index name wil be 'index' instead of None after it is read in). This PR largely assumes that the index will be preserved as a column in to_parquet, so it is up to the user to specify the correct index to read_parquet (i.e. I am not trying to do anything fancy to make the index preservation invisible to the user).

@mrocklin @martindurant Please let me know if we want/need the behavior of to_parquet and read_parquet to be more consistent before and after the refactor. I am mostly interested to know how appropriate/innappropriate it is to significantly reduce the reader's ability to automatically detect the index.

mrocklin · 2019-06-14T11:18:11Z

dask/dataframe/io/parquet/fastparquet.py


    @staticmethod
    def read_metadata(
-        fs, fs_token, paths, categories=None, index=None, gather_statistics=None


I'm curious about the change to drop the fs_token argument. Was it no longer necessary?

I'm not sure if it should be used, but I removed it wasn't being used (and I had come accross a recommendation to remove it in an earlier code review)

mrocklin · 2019-06-14T11:19:46Z

dask/dataframe/io/tests/test_parquet.py

    ddf.to_parquet(fn, write_index=index, engine=write_engine)
-    read_df = dd.read_parquet(fn, engine=read_engine)
+    if index:
+        read_df = dd.read_parquet(fn, index='a', engine=read_engine)


Maybe just index=index?

index is a bool here, so index=index should fail (maybe I am missunderstanding?)

mrocklin · 2019-06-14T11:21:16Z

Merging in. @rjzamora I'm also giving you commit rights to my fork so you should be able to push directly in the future.

rjzamora added 2 commits June 12, 2019 13:04

refactoring progress

9dc2c8e

reverting test changes

c32c3d1

mrocklin reviewed Jun 13, 2019

View reviewed changes

rjzamora and others added 6 commits June 13, 2019 10:40

some cleanup of tests and indexing

2200adf

more tests cleanup

1f61540

removing Parquetdataset() mistake

894332c

no append tests for pyarrow - not implemented

b041b46

adding fastparquet import back to tests

26dbe9e

adding default index name for compression test

c925c7e

mrocklin reviewed Jun 14, 2019

View reviewed changes

mrocklin merged commit 4ba4ebf into mrocklin:parquet-refactor Jun 14, 2019

rjzamora deleted the pq-metadata branch June 14, 2019 22:24

Conversation

rjzamora commented Jun 12, 2019

Uh oh!

rjzamora commented Jun 13, 2019

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rjzamora commented Jun 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Jun 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants