Skip to content

Appending to parquet file seems to break future appending #3098

@shughes-uk

Description

@shughes-uk

Dask 0.16.1, Fastparquet 0.1.3

This seems to be a dask problem as replicating the issue in fastparquet doesn't cause any kind of exception.

Appending to a hive parquet folder seems to break something and future appends fail. Not sure what's special about this use case (perhaps the datetimeindex?). Replication code and the exception are blow.

from datetime import datetime

import dask.dataframe as dd
import fastparquet
import numpy as np
import pandas as pd


d_range = pd.date_range(start='2017-01-01', end='2017-01-02', freq='1h')
df = pd.DataFrame(np.random.randn(len(d_range), 1), index=d_range, columns=list('A'))
ddf = dd.from_pandas(df, npartitions=1)
fastparquet.write('test_parq_fp', df, file_scheme='hive')
ddf.to_parquet('test_parq_dask')

d_range = pd.date_range(start='2017-01-02', end='2017-01-03', freq='1h')
df = pd.DataFrame(np.random.randn(len(d_range), 1), index=d_range, columns=list('A'))
ddf = dd.from_pandas(df, npartitions=1)
fastparquet.write('test_parq_fp', df, append=True, file_scheme='hive')
ddf.to_parquet('test_parq_dask', append=True)

d_range = pd.date_range(start='2017-01-04', end='2017-01-05', freq='1h')
df = pd.DataFrame(np.random.randn(len(d_range), 1), index=d_range, columns=list('A'))
ddf = dd.from_pandas(df, npartitions=1)
fastparquet.write('test_parq_fp', df, append=True, file_scheme='hive')
ddf.to_parquet('test_parq_dask', append=True)
Traceback (most recent call last):
  File "break_parquet.py", line 26, in <module>
    ddf.to_parquet('test_parq_dask', append=True)
  File "/Users/shughes/miniconda3/lib/python3.6/site-packages/dask/dataframe/core.py", line 1010, in to_parquet
    return to_parquet(self, path, *args, **kwargs)
  File "/Users/shughes/miniconda3/lib/python3.6/site-packages/dask/dataframe/io/parquet.py", line 840, in to_parquet
    storage_options=storage_options, **kwargs)
  File "/Users/shughes/miniconda3/lib/python3.6/site-packages/dask/dataframe/io/parquet.py", line 420, in _write_fastparquet
    old_end = minmax[index_cols[0]]['max'][-1]
KeyError: 'index'

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions