p2p shuffled pandas data takes more memory

I observe that after I call `set_index` on the uber-lyft data with `p2p` that my dataset takes up more memory than before.  When I use `tasks`, it doesn't.  cc @hendrikmakait 

Reproducible (but not minimal) example:

```python
import dask
from dask.distributed import wait
import dask.dataframe as dd

dask.config.set({"dataframe.convert-string": True})  # use PyArrow strings by default

df = dd.read_parquet(
    "s3://coiled-datasets/uber-lyft-tlc/",
)

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 100

df = df.set_index("request_datetime", shuffle="p2p").persist()

print(df.memory_usage(deep=True).sum().compute() / 1e9)  # about 200
```

If you try wth `shuffle="tasks"` it doesn't expand that much.  I haven't tried this without Arrow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

p2p shuffled pandas data takes more memory #10326

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

p2p shuffled pandas data takes more memory #10326

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions