-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Closed
Description
What happened: when setting a new index on a Dask DataFrame through the set_index method, in some cases of DataFrame with empty partitions, some rows are dropped in the process. This only seems to happen when using the argument sorted=True.
What you expected to happen: set_index should not drop rows.
Minimal Complete Verifiable Example:
import pandas as pd
import dask.dataframe as dd
# Create dataframe with empty partitions
data1 = dd.from_pandas(
pd.DataFrame(
index=[0, 1, 2],
data={
"datA": ["A1", "A2", "A3"],
"datB": ["B1", "B2", "B3"],
"datC": ["C1", "C2", "C3"],
"new_index": [1, 2, 3]
}
),
npartitions=1
)
data_dummy = dd.from_pandas( # These data will be removed to create empty partitions
pd.DataFrame(
index=[9999],
data={
"datA": ["xxx"],
"datB": ["xxx"],
"datC": ["xxx"],
"new_index": [9999]
}
),
npartitions=1
)
data2 = dd.from_pandas(
pd.DataFrame(
index=[3],
data={
"datA": ["A4"],
"datB": ["B4"],
"datC": ["C4"],
"new_index": [4]
}
),
npartitions=1
)
ddf = dd.concat([data1, data_dummy, data2, data_dummy])
ddf = ddf[ddf["datA"] != "xxx"]
# Set index, the last row gets dropped in the result
print(ddf.set_index("new_index", sorted=True).compute())The test Dataframe ddf looks like
datA datB datC new_index
0 A1 B1 C1 1
1 A2 B2 C2 2
2 A3 B3 C3 3
4 A4 B4 C4 4
with partitions:
- Partition 0: indices
0,1and2 - Partition 1: empty
- Partition 2: index
3 - Partition 3: empty
After set_index, the resultant DataFrame is
datA datB datC
new_index
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
where the last row has been dropped.
This does not happen if sorted=True is not used:
print(ddf.set_index("new_index").compute()) datA datB datC
new_index
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
Environment:
- Dask version: 2022.01.0
- Python version: 3.9.7
- Operating System: Ubuntu 16.04.6 LTS
- Install method (conda, pip, source): pip
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels