-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
p2p shuffled pandas data takes more memory #10326
Copy link
Copy link
Closed
dask/distributed
#7879Labels
bugSomething is brokenSomething is brokenneeds triageNeeds a response from a contributorNeeds a response from a contributor
Description
I observe that after I call set_index on the uber-lyft data with p2p that my dataset takes up more memory than before. When I use tasks, it doesn't. cc @hendrikmakait
Reproducible (but not minimal) example:
import dask
from dask.distributed import wait
import dask.dataframe as dd
dask.config.set({"dataframe.convert-string": True}) # use PyArrow strings by default
df = dd.read_parquet(
"s3://coiled-datasets/uber-lyft-tlc/",
)
print(df.memory_usage(deep=True).sum().compute() / 1e9) # about 100
df = df.set_index("request_datetime", shuffle="p2p").persist()
print(df.memory_usage(deep=True).sum().compute() / 1e9) # about 200If you try wth shuffle="tasks" it doesn't expand that much. I haven't tried this without Arrow.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething is brokenSomething is brokenneeds triageNeeds a response from a contributorNeeds a response from a contributor