A long time before jobs start running when reading parquet files

I am using dask-yarn on Amazon EMR almost exactly as [described in the documentation](http://yarn.dask.org/en/latest/aws-emr.html) including your handy bootstrap script (thanks). 

![image](https://user-images.githubusercontent.com/5325034/56164461-f8c1f900-5f8d-11e9-9fc9-ab0e79ff8607.png)

I have ~16,000 parquet files (~110 GB on disk) that I'm trying to load into a dask dataframe from s3. The operation 

```python
import dask.dataframe as dd
dd.from_parquet('s3://bucket/path/to/parquet-files')
```

hangs longer than I have the patience to wait for (more than an hour). The workers never light up in the dask dashboard.

So, to try to get some insight, I wrote a function that just loads each parquet file into a pandas dataframe and returns its memory usage. I then randomly selected 100 of the files and submitted that function as a `dask.delayed` ie.

```python
dask.compute(map(dask.delayed(get_memory_usage_of_parquet_file), subset_of_100_parquet_s3_paths))
```

This takes about 3 minutes to run. About 5 seconds of that is the actual work being done by the workers. Each file takes up about 70-80 MB in memory, so it's not a memory overrun. (That's why I was checking this.) 

What am I missing? Is there some configuration change that will make this more performant?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

A long time before jobs start running when reading parquet files #4701

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

A long time before jobs start running when reading parquet files #4701

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions