Compat for pyarrow 0.8.0 by TomAugspurger · Pull Request #2973 · dask/dask

TomAugspurger · 2017-12-08T02:52:35Z

I've tested locally against pyarrow 0.7.1 and pyarrow master + apache/arrow#1397. There's one type of failure with pyarrow master. fastparquet cannot read files created by pyarrow when

It's written as a ParquetDataset
There are at least 129 rows.

I'm still trying to figure out what's going on, but fastparquet fails with a long exception, the crux of which is


    def readString(self):
>       return binary_to_str(self.readBinary())

../../../miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py:184:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

bin_val = b'\x00\x00\x00\x00\x00\x00\xf0?'

    def binary_to_str(bin_val):
>       return bin_val.decode('utf8')
E       UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 6: invalid continuation byte

We'll need to wait for apache/arrow#1397 to be merged and a conda package pushed, then I'll update our CI to test against pyarrow master.

cc @xhochy, @cpcloud

TomAugspurger · 2017-12-08T02:53:43Z

Heres a reprex for the pyarrow Dataset -> fastparquet bug:

import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import tempfile
import fastparquet as fp

df = pd.DataFrame({"A": range(129)})
t = pa.Table.from_pandas(df)
tmpdir = tempfile.TemporaryDirectory().name

pq.write_to_dataset(t, tmpdir)

fp.ParquetFile(os.path.join(tmpdir,  os.listdir(tmpdir)[0])).to_pandas()

Fails with:

Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 96, in __init__
    with open_with(fn2, 'rb') as f:
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/util.py", line 44, in default_open
    return open(f, mode)
NotADirectoryError: [Errno 20] Not a directory: '/var/folders/hz/f43khqfn7b1b1g8z_z6y3bsw0000gp/T/tmpiplwgazx/d77558b180f74f889f53aa8ead7d8c58.parquet/_metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 119, in _parse_header
    fmd = read_thrift(f, parquet_thrift.FileMetaData)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/thrift_structures.py", line 24, in read_thrift
    obj.read(pin)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1899, in read
    _elem53.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1742, in read
    _elem33.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1656, in read
    self.meta_data.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1487, in read
    self.statistics.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 298, in read
    iprot.skip(ftype)
  File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py", line 208, in skip
    self.readString()
  File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py", line 184, in readString
    return binary_to_str(self.readBinary())
  File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/compat.py", line 37, in binary_to_str
    return bin_val.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 14, in <module>
    fp.ParquetFile(os.path.join(tmpdir,  os.listdir(tmpdir)[0])).to_pandas()
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 102, in __init__
    self._parse_header(f, verify)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 122, in _parse_header
    self.fn)
fastparquet.util.ParquetException: Metadata parse failed: /var/folders/hz/f43khqfn7b1b1g8z_z6y3bsw0000gp/T/tmpiplwgazx/d77558b180f74f889f53aa8ead7d8c58.parquet

jorisvandenbossche · 2017-12-08T12:52:16Z

dask/dataframe/io/parquet.py

+    # though I'd like to avoid relying on that.
+    if not index_names:
+        # For PyArrow < 0.8, Any fastparquet. This relies on the facts that
+        # 1. Those versions used the real index name as the index storage name


except for index names of None? (not sure how this is handled in dask)

jorisvandenbossche · 2017-12-08T12:56:40Z

dask/dataframe/io/parquet.py

+        # iff it's an index level. Though this is a fragile assumption for
+        # other systems...
+        column_names = [real_name for (storage_name, real_name) in pairs
+                        if real_name == storage_name]


Can't you do

column_names = [real_name for (storage_name, real_name) in pairs if storage_name not in index_storage_names]

that might give the same, but seems more logical to do

I think that fails if there's duplicates between the column names and index storage names, i.e. a column named '__index_level_0__' (which probably isn't an issue in practice, but it'd be nice to avoid that).

I'm writing tests for all the edge cases now.

I wonder if this should just live in pandas.

TomAugspurger · 2017-12-11T19:11:09Z

Going to have to put this on hold for a day or two. I'll probably break things up into smaller PRs since this is broadening in scope.

TomAugspurger · 2017-12-18T21:22:43Z

Coming back to this now. It should be ready for review if anyone has a chance.

TomAugspurger · 2017-12-18T22:08:53Z

Hmm this passed, but I'm getting a failure locally with pyarrow 0.7.1 when the index name is None. Working on it.

TomAugspurger · 2017-12-18T22:27:13Z

I have to step out for a few hours. If this is causing delays on other PRs I'd recommend xfailing for now and I'll remove those xfails when I pick it up again tonight or tomorrow.

TomAugspurger · 2017-12-19T03:51:15Z

My last push added some historical files that we can test against. Not sure what people's thoughts on that are, but I think it'd be nice to ensure we can continue to read those.

There should be some additional failures. Will finish this up tomorrow.

jcrist · 2017-12-19T04:21:20Z

Not sure what people's thoughts on that are, but I think it'd be nice to ensure we can continue to read those.

I'm generally against adding binary files (and ipython notebooks and other hard-to-diff things) to library git repos. If these files are necessary for thorough tests (as they may be) could you instead create a new repo in the dask org, and have the tests also download those files for testing? Could be as simple as a python package with data in it that is installed as well on travis (might make use of the package_data kwarg in setuptools):

pip install git+https://github.com/dask/dask_test_data.git

Having the data stored in an external repo will help prevent git bloat in this repo, and should hopefully not be too bad to manage.

TomAugspurger · 2017-12-19T13:45:54Z

I split the binary files off into https://github.com/dask/parquet-integration, though I'm already thinking that could be structured better. Maybe as a followup.

Edit: OK, I'm not really happy with how I've done the historical tests. I've removed them for now and will do a followup.

TomAugspurger · 2017-12-19T18:14:18Z

CI passed on 367fd42. Just pushed a couple style cleanups.

mrocklin · 2017-12-20T20:34:28Z

We should either merge this if its ready or xfail the PyArrow test suite. I won't personally have time to review this for at least the next day. @TomAugspurger the decision on what to do here is probably on you.

TomAugspurger · 2017-12-20T20:53:41Z

I won't have a whole lot of time over the next week either.

I think it's an improvement in it's current state and would recommend merging it, and I can work to refactor things in followup PRs.

TomAugspurger · 2017-12-20T20:57:04Z

I'm going to go through once more now, and merge if things look OK.

TomAugspurger · 2017-12-20T21:13:51Z

Quick summary of what I think are the remaining known issues:

fastparquet can't read files written by pyarrow's dask. I think this is the recent common_metadata. vs. metadata change. Should be an easy fix in fastparquet. Read datasets written by dask and pyarrow fastparquet#266
PyArrow can't read categoricals written by fastparquet (there's already an open JIRA for this I think).

TomAugspurger · 2017-12-20T21:19:25Z

Bombs away. I'll hopefully get to the integration testing within 2 weeks.

TomAugspurger added 2 commits December 7, 2017 17:34

TST: Expand coverage

dccd0ad

COMPAT: For pyarrow 0.8

7152112

TomAugspurger added 2 commits December 7, 2017 21:02

Fixup docs

a383cfa

fixup! Fixup docs

9b0de2b

jorisvandenbossche reviewed Dec 8, 2017

View reviewed changes

TomAugspurger added 8 commits December 8, 2017 09:07

PY2 compat

a75e3e3

Merge remote-tracking branch 'upstream/master' into pyarrow-compat-2

f41ef32

Changelog

0102953

CI: Test against pyarrow master

d5c3bff

BUG/ENH: Correctly set df.columns.name

e8dfdb3

fixup! CI: Test against pyarrow master

781c967

fixup! BUG/ENH: Correctly set df.columns.name

d9ac874

fixup! fixup! BUG/ENH: Correctly set df.columns.name

b95f1a9

TomAugspurger mentioned this pull request Dec 11, 2017

Fastparquet fails on reading pyarrow written metadata on Windows dask/fastparquet#261

Closed

TomAugspurger added 3 commits December 11, 2017 11:44

REF: User pandas metadata for fastparquet reader

e5d538f

Merge remote-tracking branch 'upstream/master' into pyarrow-compat-2

254971d

cleanup

03d5724

TomAugspurger added 5 commits December 15, 2017 15:30

Merge remote-tracking branch 'upstream/master' into pyarrow-compat-2

3581bc8

Various fastparquet updates

4d38333

one more

20c8ccd

skip it

121b924

Removed data files

a4006a8

jcrist mentioned this pull request Dec 18, 2017

Parquet PyArrow tests failing #3011

Closed

TomAugspurger added 2 commits December 18, 2017 15:23

Merge remote-tracking branch 'upstream/master' into pyarrow-compat-2

f3897c6

Skip column index names on old arrow

e49ce80

mrocklin mentioned this pull request Dec 18, 2017

allow custom shape and/or label in to_graphviz() #2987

Merged

3 tasks

pep8

cb6b45b

TomAugspurger added 2 commits December 18, 2017 20:26

Handle null index names

dd19325

Passing for 0.7.1

d60c20e

TomAugspurger force-pushed the pyarrow-compat-2 branch from d9a4213 to d60c20e Compare December 19, 2017 12:55

TomAugspurger added 3 commits December 19, 2017 07:04

pep8

128381a

Bump skip for pyarrow

899a553

Added historical to CI

5c4abed

TomAugspurger added 5 commits December 19, 2017 08:34

Remove historical tests

5dfc217

REF: refactor columns

2f1e54d

REF: refactor index normalization

1c472e2

Py2 compat

367fd42

Cleanups

3c3ad02

TomAugspurger merged commit 1fef002 into dask:master Dec 20, 2017

TomAugspurger deleted the pyarrow-compat-2 branch December 20, 2017 21:19

TomAugspurger mentioned this pull request Jan 31, 2018

dd.read_parquet no longer include index columns #3104

Closed

Uh oh!

Conversation

TomAugspurger commented Dec 8, 2017

Uh oh!

TomAugspurger commented Dec 8, 2017

Uh oh!

jorisvandenbossche Dec 8, 2017

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Dec 8, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Dec 8, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Dec 8, 2017

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Dec 11, 2017

Uh oh!

TomAugspurger commented Dec 18, 2017

Uh oh!

TomAugspurger commented Dec 18, 2017

Uh oh!

TomAugspurger commented Dec 18, 2017

Uh oh!

TomAugspurger commented Dec 19, 2017

Uh oh!

jcrist commented Dec 19, 2017

Uh oh!

TomAugspurger commented Dec 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 19, 2017

Uh oh!

mrocklin commented Dec 20, 2017

Uh oh!

TomAugspurger commented Dec 20, 2017

Uh oh!

TomAugspurger commented Dec 20, 2017

Uh oh!

TomAugspurger commented Dec 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Dec 20, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomAugspurger commented Dec 19, 2017 •

edited

Loading

TomAugspurger commented Dec 20, 2017 •

edited

Loading