Skip to content

Do not force repartition of dataframes that are already aligned in align_partitions #9922

@epizut

Description

@epizut

Hi,

Following why align_partitions() use force=True? on discourse:

When passing severals dask dataframes to map_partitions(), it looks like the underling align_partitions() force repartitioning every dd even if only a single one needs it.

This generates non-necessary graph nodes that slow down packing/sending/unpacking/schedule stages.

dds = [dask.datasets.timeseries() for i in range(5)]
# Only the latest dd partitionning will differ
dds[-1] = dds[-1].repartition(npartitions=1)

def func(*args):
    return args[0]

dd = dask.dataframe.map_partitions(func, *dds)
dd.dask

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions