Skip to content

read_parquet with filters and arrow engine returns index in the columns #6277

@TomAugspurger

Description

@TomAugspurger

What happened:

When using dd.read_parquet with filters, the original index ends up in the columns

What you expected to happen:

Minimal Complete Verifiable Example:

import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.datasets
import shutil

N = 1_000
P = 4
b = np.arange(P).repeat(N // P)

df = pd.DataFrame({
    "A": np.random.random(size=N),
    "B": pd.Categorical(b)
})
ddf = dd.from_pandas(df, P)

shutil.rmtree("parquet-filters.parquet", ignore_errors=True)
ddf.to_parquet("parquet-filters.parquet", partition_on="B", engine="pyarrow")

dd.read_parquet("parquet-filters.parquet", engine="pyarrow", filters=[[("B", "=", 1)]]).compute()
Out[1]:
            A  index  B
0    0.410034    250  1
1    0.313085    251  1
2    0.848964    252  1
3    0.910770    253  1
4    0.922110    254  1
..        ...    ... ..
245  0.028301    495  1
246  0.392158    496  1
247  0.759204    497  1
248  0.122554    498  1
249  0.339143    499  1

[250 rows x 3 columns]

Anything else we need to know?:

Haven't debugged yet, but for the fastparquet engine the index is in the index, but the outupt is not filtered

   ...: ddf.to_parquet("parquet-filters.parquet", partition_on="B", engine="fastparquet")
   ...:
   ...: dd.read_parquet("parquet-filters.parquet", engine="fastparquet", filters=[[("B", "=", 1)]]).compute()

              A  B
index
0      0.304616  0
1      0.626666  0
2      0.986920  0
3      0.585114  0
4      0.663830  0
...         ... ..
995    0.999620  3
996    0.223592  3
997    0.796024  3
998    0.063832  3
999    0.621195  3

[1000 rows x 2 columns]

Environment:

  • Dask version: master
  • Python version: 3.7

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions