Skip to content

p2p shuffled pandas data takes more memory #10326

@mrocklin

Description

@mrocklin

I observe that after I call set_index on the uber-lyft data with p2p that my dataset takes up more memory than before. When I use tasks, it doesn't. cc @hendrikmakait

Reproducible (but not minimal) example:

import dask
from dask.distributed import wait
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
)

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 100

df = df.set_index("request_datetime", shuffle="p2p").persist()

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 200

If you try wth shuffle="tasks" it doesn't expand that much. I haven't tried this without Arrow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokenneeds triageNeeds a response from a contributor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions