Skip to content

A long time before jobs start running when reading parquet files #4701

@nmerket

Description

@nmerket

I am using dask-yarn on Amazon EMR almost exactly as described in the documentation including your handy bootstrap script (thanks).

image

I have ~16,000 parquet files (~110 GB on disk) that I'm trying to load into a dask dataframe from s3. The operation

import dask.dataframe as dd
dd.from_parquet('s3://bucket/path/to/parquet-files')

hangs longer than I have the patience to wait for (more than an hour). The workers never light up in the dask dashboard.

So, to try to get some insight, I wrote a function that just loads each parquet file into a pandas dataframe and returns its memory usage. I then randomly selected 100 of the files and submitted that function as a dask.delayed ie.

dask.compute(map(dask.delayed(get_memory_usage_of_parquet_file), subset_of_100_parquet_s3_paths))

This takes about 3 minutes to run. About 5 seconds of that is the actual work being done by the workers. Each file takes up about 70-80 MB in memory, so it's not a memory overrun. (That's why I was checking this.)

What am I missing? Is there some configuration change that will make this more performant?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions