-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
Dask 0.16.1, Fastparquet 0.1.3
This seems to be a dask problem as replicating the issue in fastparquet doesn't cause any kind of exception.
Appending to a hive parquet folder seems to break something and future appends fail. Not sure what's special about this use case (perhaps the datetimeindex?). Replication code and the exception are blow.
from datetime import datetime
import dask.dataframe as dd
import fastparquet
import numpy as np
import pandas as pd
d_range = pd.date_range(start='2017-01-01', end='2017-01-02', freq='1h')
df = pd.DataFrame(np.random.randn(len(d_range), 1), index=d_range, columns=list('A'))
ddf = dd.from_pandas(df, npartitions=1)
fastparquet.write('test_parq_fp', df, file_scheme='hive')
ddf.to_parquet('test_parq_dask')
d_range = pd.date_range(start='2017-01-02', end='2017-01-03', freq='1h')
df = pd.DataFrame(np.random.randn(len(d_range), 1), index=d_range, columns=list('A'))
ddf = dd.from_pandas(df, npartitions=1)
fastparquet.write('test_parq_fp', df, append=True, file_scheme='hive')
ddf.to_parquet('test_parq_dask', append=True)
d_range = pd.date_range(start='2017-01-04', end='2017-01-05', freq='1h')
df = pd.DataFrame(np.random.randn(len(d_range), 1), index=d_range, columns=list('A'))
ddf = dd.from_pandas(df, npartitions=1)
fastparquet.write('test_parq_fp', df, append=True, file_scheme='hive')
ddf.to_parquet('test_parq_dask', append=True)Traceback (most recent call last):
File "break_parquet.py", line 26, in <module>
ddf.to_parquet('test_parq_dask', append=True)
File "/Users/shughes/miniconda3/lib/python3.6/site-packages/dask/dataframe/core.py", line 1010, in to_parquet
return to_parquet(self, path, *args, **kwargs)
File "/Users/shughes/miniconda3/lib/python3.6/site-packages/dask/dataframe/io/parquet.py", line 840, in to_parquet
storage_options=storage_options, **kwargs)
File "/Users/shughes/miniconda3/lib/python3.6/site-packages/dask/dataframe/io/parquet.py", line 420, in _write_fastparquet
old_end = minmax[index_cols[0]]['max'][-1]
KeyError: 'index'
Reactions are currently unavailable