Conversation
|
Heres a reprex for the pyarrow Dataset -> fastparquet bug: import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import tempfile
import fastparquet as fp
df = pd.DataFrame({"A": range(129)})
t = pa.Table.from_pandas(df)
tmpdir = tempfile.TemporaryDirectory().name
pq.write_to_dataset(t, tmpdir)
fp.ParquetFile(os.path.join(tmpdir, os.listdir(tmpdir)[0])).to_pandas()Fails with: Traceback (most recent call last):
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 96, in __init__
with open_with(fn2, 'rb') as f:
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/util.py", line 44, in default_open
return open(f, mode)
NotADirectoryError: [Errno 20] Not a directory: '/var/folders/hz/f43khqfn7b1b1g8z_z6y3bsw0000gp/T/tmpiplwgazx/d77558b180f74f889f53aa8ead7d8c58.parquet/_metadata'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 119, in _parse_header
fmd = read_thrift(f, parquet_thrift.FileMetaData)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/thrift_structures.py", line 24, in read_thrift
obj.read(pin)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1899, in read
_elem53.read(iprot)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1742, in read
_elem33.read(iprot)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1656, in read
self.meta_data.read(iprot)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 1487, in read
self.statistics.read(iprot)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/parquet_thrift/parquet/ttypes.py", line 298, in read
iprot.skip(ftype)
File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py", line 208, in skip
self.readString()
File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/protocol/TProtocol.py", line 184, in readString
return binary_to_str(self.readBinary())
File "/Users/taugspurger/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/thrift/compat.py", line 37, in binary_to_str
return bin_val.decode('utf8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "bug.py", line 14, in <module>
fp.ParquetFile(os.path.join(tmpdir, os.listdir(tmpdir)[0])).to_pandas()
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 102, in __init__
self._parse_header(f, verify)
File "/Users/taugspurger/sandbox/repos/fastparquet/fastparquet/api.py", line 122, in _parse_header
self.fn)
fastparquet.util.ParquetException: Metadata parse failed: /var/folders/hz/f43khqfn7b1b1g8z_z6y3bsw0000gp/T/tmpiplwgazx/d77558b180f74f889f53aa8ead7d8c58.parquet |
| # though I'd like to avoid relying on that. | ||
| if not index_names: | ||
| # For PyArrow < 0.8, Any fastparquet. This relies on the facts that | ||
| # 1. Those versions used the real index name as the index storage name |
There was a problem hiding this comment.
except for index names of None? (not sure how this is handled in dask)
dask/dataframe/io/parquet.py
Outdated
| # iff it's an index level. Though this is a fragile assumption for | ||
| # other systems... | ||
| column_names = [real_name for (storage_name, real_name) in pairs | ||
| if real_name == storage_name] |
There was a problem hiding this comment.
Can't you do
column_names = [real_name for (storage_name, real_name) in pairs
if storage_name not in index_storage_names]
that might give the same, but seems more logical to do
There was a problem hiding this comment.
I think that fails if there's duplicates between the column names and index storage names, i.e. a column named '__index_level_0__' (which probably isn't an issue in practice, but it'd be nice to avoid that).
There was a problem hiding this comment.
I'm writing tests for all the edge cases now.
I wonder if this should just live in pandas.
|
Going to have to put this on hold for a day or two. I'll probably break things up into smaller PRs since this is broadening in scope. |
|
Coming back to this now. It should be ready for review if anyone has a chance. |
|
Hmm this passed, but I'm getting a failure locally with pyarrow 0.7.1 when the index name is None. Working on it. |
|
I have to step out for a few hours. If this is causing delays on other PRs I'd recommend xfailing for now and I'll remove those xfails when I pick it up again tonight or tomorrow. |
|
My last push added some historical files that we can test against. Not sure what people's thoughts on that are, but I think it'd be nice to ensure we can continue to read those. There should be some additional failures. Will finish this up tomorrow. |
I'm generally against adding binary files (and ipython notebooks and other hard-to-diff things) to library git repos. If these files are necessary for thorough tests (as they may be) could you instead create a new repo in the dask org, and have the tests also download those files for testing? Could be as simple as a python package with data in it that is installed as well on travis (might make use of the Having the data stored in an external repo will help prevent git bloat in this repo, and should hopefully not be too bad to manage. |
d9a4213 to
d60c20e
Compare
|
I split the binary files off into https://github.com/dask/parquet-integration, though I'm already thinking that could be structured better. Maybe as a followup. Edit: OK, I'm not really happy with how I've done the historical tests. I've removed them for now and will do a followup. |
|
CI passed on 367fd42. Just pushed a couple style cleanups. |
|
We should either merge this if its ready or xfail the PyArrow test suite. I won't personally have time to review this for at least the next day. @TomAugspurger the decision on what to do here is probably on you. |
|
I won't have a whole lot of time over the next week either. I think it's an improvement in it's current state and would recommend merging it, and I can work to refactor things in followup PRs. |
|
I'm going to go through once more now, and merge if things look OK. |
|
Quick summary of what I think are the remaining known issues:
|
|
Bombs away. I'll hopefully get to the integration testing within 2 weeks. |
Closes #2901
I've tested locally against pyarrow 0.7.1 and pyarrow master + apache/arrow#1397. There's one type of failure with pyarrow master. fastparquet cannot read files created by pyarrow when
ParquetDatasetI'm still trying to figure out what's going on, but fastparquet fails with a long exception, the crux of which is
We'll need to wait for apache/arrow#1397 to be merged and a conda package pushed, then I'll update our CI to test against pyarrow master.
cc @xhochy, @cpcloud