Skip to content

Compat for pyarrow 0.8.0#2973

Merged
TomAugspurger merged 33 commits intodask:masterfrom
TomAugspurger:pyarrow-compat-2
Dec 20, 2017
Merged

Compat for pyarrow 0.8.0#2973
TomAugspurger merged 33 commits intodask:masterfrom
TomAugspurger:pyarrow-compat-2

Conversation

@TomAugspurger
Copy link
Member

Closes #2901

I've tested locally against pyarrow 0.7.1 and pyarrow master + apache/arrow#1397. There's one type of failure with pyarrow master. fastparquet cannot read files created by pyarrow when

  1. It's written as a ParquetDataset
  2. There are at least 129 rows.

I'm still trying to figure out what's going on, but fastparquet fails with a long exception, the crux of which is


    def readString(self):
>       return binary_to_str(self.readBinary())

../../../miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py:184:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

bin_val = b'\x00\x00\x00\x00\x00\x00\xf0?'

    def binary_to_str(bin_val):
>       return bin_val.decode('utf8')
E       UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0 in position 6: invalid continuation byte

We'll need to wait for apache/arrow#1397 to be merged and a conda package pushed, then I'll update our CI to test against pyarrow master.

cc @xhochy, @cpcloud

@TomAugspurger
Copy link
Member Author

Heres a reprex for the pyarrow Dataset -> fastparquet bug:

import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import tempfile
import fastparquet as fp

df = pd.DataFrame({"A": range(129)})
t = pa.Table.from_pandas(df)
tmpdir = tempfile.TemporaryDirectory().name

pq.write_to_dataset(t, tmpdir)

fp.ParquetFile(os.path.join(tmpdir,  os.listdir(tmpdir)[0])).to_pandas()

Fails with:

Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 96, in __init__
    with open_with(fn2, 'rb') as f:
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/util.py", line 44, in default_open
    return open(f, mode)
NotADirectoryError: [Errno 20] Not a directory: '/var/folders/hz/f43khqfn7b1b1g8z_z6y3bsw0000gp/T/tmpiplwgazx/d77558b180f74f889f53aa8ead7d8c58.parquet/_metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 119, in _parse_header
    fmd = read_thrift(f, parquet_thrift.FileMetaData)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/thrift_structures.py", line 24, in read_thrift
    obj.read(pin)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1899, in read
    _elem53.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1742, in read
    _elem33.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1656, in read
    self.meta_data.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1487, in read
    self.statistics.read(iprot)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 298, in read
    iprot.skip(ftype)
  File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py", line 208, in skip
    self.readString()
  File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py", line 184, in readString
    return binary_to_str(self.readBinary())
  File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/compat.py", line 37, in binary_to_str
    return bin_val.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "bug.py", line 14, in <module>
    fp.ParquetFile(os.path.join(tmpdir,  os.listdir(tmpdir)[0])).to_pandas()
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 102, in __init__
    self._parse_header(f, verify)
  File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 122, in _parse_header
    self.fn)
fastparquet.util.ParquetException: Metadata parse failed: /var/folders/hz/f43khqfn7b1b1g8z_z6y3bsw0000gp/T/tmpiplwgazx/d77558b180f74f889f53aa8ead7d8c58.parquet

# though I'd like to avoid relying on that.
if not index_names:
# For PyArrow < 0.8, Any fastparquet. This relies on the facts that
# 1. Those versions used the real index name as the index storage name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except for index names of None? (not sure how this is handled in dask)

# iff it's an index level. Though this is a fragile assumption for
# other systems...
column_names = [real_name for (storage_name, real_name) in pairs
if real_name == storage_name]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you do

column_names = [real_name for (storage_name, real_name) in pairs 
                if storage_name not in index_storage_names]

that might give the same, but seems more logical to do

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that fails if there's duplicates between the column names and index storage names, i.e. a column named '__index_level_0__' (which probably isn't an issue in practice, but it'd be nice to avoid that).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm writing tests for all the edge cases now.

I wonder if this should just live in pandas.

@TomAugspurger
Copy link
Member Author

Going to have to put this on hold for a day or two. I'll probably break things up into smaller PRs since this is broadening in scope.

@TomAugspurger
Copy link
Member Author

Coming back to this now. It should be ready for review if anyone has a chance.

@TomAugspurger
Copy link
Member Author

Hmm this passed, but I'm getting a failure locally with pyarrow 0.7.1 when the index name is None. Working on it.

@TomAugspurger
Copy link
Member Author

I have to step out for a few hours. If this is causing delays on other PRs I'd recommend xfailing for now and I'll remove those xfails when I pick it up again tonight or tomorrow.

@TomAugspurger
Copy link
Member Author

My last push added some historical files that we can test against. Not sure what people's thoughts on that are, but I think it'd be nice to ensure we can continue to read those.

There should be some additional failures. Will finish this up tomorrow.

@jcrist
Copy link
Member

jcrist commented Dec 19, 2017

Not sure what people's thoughts on that are, but I think it'd be nice to ensure we can continue to read those.

I'm generally against adding binary files (and ipython notebooks and other hard-to-diff things) to library git repos. If these files are necessary for thorough tests (as they may be) could you instead create a new repo in the dask org, and have the tests also download those files for testing? Could be as simple as a python package with data in it that is installed as well on travis (might make use of the package_data kwarg in setuptools):

pip install git+https://github.com/dask/dask_test_data.git

Having the data stored in an external repo will help prevent git bloat in this repo, and should hopefully not be too bad to manage.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Dec 19, 2017

I split the binary files off into https://github.com/dask/parquet-integration, though I'm already thinking that could be structured better. Maybe as a followup.

Edit: OK, I'm not really happy with how I've done the historical tests. I've removed them for now and will do a followup.

@TomAugspurger
Copy link
Member Author

CI passed on 367fd42. Just pushed a couple style cleanups.

@mrocklin
Copy link
Member

We should either merge this if its ready or xfail the PyArrow test suite. I won't personally have time to review this for at least the next day. @TomAugspurger the decision on what to do here is probably on you.

@TomAugspurger
Copy link
Member Author

I won't have a whole lot of time over the next week either.

I think it's an improvement in it's current state and would recommend merging it, and I can work to refactor things in followup PRs.

@TomAugspurger
Copy link
Member Author

I'm going to go through once more now, and merge if things look OK.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Dec 20, 2017

Quick summary of what I think are the remaining known issues:

  1. fastparquet can't read files written by pyarrow's dask. I think this is the recent common_metadata. vs. metadata change. Should be an easy fix in fastparquet. Read datasets written by dask and pyarrow fastparquet#266
  2. PyArrow can't read categoricals written by fastparquet (there's already an open JIRA for this I think).

@TomAugspurger
Copy link
Member Author

Bombs away. I'll hopefully get to the integration testing within 2 weeks.

@TomAugspurger TomAugspurger merged commit 1fef002 into dask:master Dec 20, 2017
@TomAugspurger TomAugspurger deleted the pyarrow-compat-2 branch December 20, 2017 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants