Skip to content

read_parquet graph transfer grows more than lienearly with number of partitions #5321

@birdsarah

Description

@birdsarah

dask 2.3.0

my calls to to_parquet were never starting - meaning after waiting a long time I still didn't see the tasks appear in the "progress" bar which, to my untrained eye, means that the graph is taking a long time to compute.

I copy pasted the code from dataframe/core.py to_parquet to build my own parts

https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/core.py#L458-L481

    # write parts
    dwrite = delayed(engine.write_partition)
    parts = [
        dwrite(
            d,
            path,
            fs,
            filename,
            partition_on,
            write_metadata_file,
            fmd=meta,
            index_cols=index_cols,
            **kwargs_pass
        )
        for d, filename in zip(df.to_delayed(), filenames)
    ]

    # single task to complete
    out = delayed(lambda x: None)(parts)
    if write_metadata_file:
        out = delayed(engine.write_metadata)(
            parts, meta, fs, path, append=append, compression=compression
        )

I then manually put this in chunks into client.compute

if I do 10 parts, after 27s my tasks appear. If I do 100 parts, after 4.5 minutes (270s) my tasks appear. My dataset has 4000 partitions so, at this rate, it will be 180 minutes before my task graph will even appear and start executing.

For comparison it takes ~2hours to do the full execution on dask 0.2.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions