-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
What happened:
When using dd.read_parquet with filters, the original index ends up in the columns
What you expected to happen:
Minimal Complete Verifiable Example:
import numpy as np
import pandas as pd
import dask.dataframe as dd
import dask.datasets
import shutil
N = 1_000
P = 4
b = np.arange(P).repeat(N // P)
df = pd.DataFrame({
"A": np.random.random(size=N),
"B": pd.Categorical(b)
})
ddf = dd.from_pandas(df, P)
shutil.rmtree("parquet-filters.parquet", ignore_errors=True)
ddf.to_parquet("parquet-filters.parquet", partition_on="B", engine="pyarrow")
dd.read_parquet("parquet-filters.parquet", engine="pyarrow", filters=[[("B", "=", 1)]]).compute()Out[1]:
A index B
0 0.410034 250 1
1 0.313085 251 1
2 0.848964 252 1
3 0.910770 253 1
4 0.922110 254 1
.. ... ... ..
245 0.028301 495 1
246 0.392158 496 1
247 0.759204 497 1
248 0.122554 498 1
249 0.339143 499 1
[250 rows x 3 columns]Anything else we need to know?:
Haven't debugged yet, but for the fastparquet engine the index is in the index, but the outupt is not filtered
...: ddf.to_parquet("parquet-filters.parquet", partition_on="B", engine="fastparquet")
...:
...: dd.read_parquet("parquet-filters.parquet", engine="fastparquet", filters=[[("B", "=", 1)]]).compute()
A B
index
0 0.304616 0
1 0.626666 0
2 0.986920 0
3 0.585114 0
4 0.663830 0
... ... ..
995 0.999620 3
996 0.223592 3
997 0.796024 3
998 0.063832 3
999 0.621195 3
[1000 rows x 2 columns]Environment:
- Dask version: master
- Python version: 3.7
Reactions are currently unavailable