-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I am using dask-yarn on Amazon EMR almost exactly as described in the documentation including your handy bootstrap script (thanks).
I have ~16,000 parquet files (~110 GB on disk) that I'm trying to load into a dask dataframe from s3. The operation
import dask.dataframe as dd
dd.from_parquet('s3://bucket/path/to/parquet-files')hangs longer than I have the patience to wait for (more than an hour). The workers never light up in the dask dashboard.
So, to try to get some insight, I wrote a function that just loads each parquet file into a pandas dataframe and returns its memory usage. I then randomly selected 100 of the files and submitted that function as a dask.delayed ie.
dask.compute(map(dask.delayed(get_memory_usage_of_parquet_file), subset_of_100_parquet_s3_paths))This takes about 3 minutes to run. About 5 seconds of that is the actual work being done by the workers. Each file takes up about 70-80 MB in memory, so it's not a memory overrun. (That's why I was checking this.)
What am I missing? Is there some configuration change that will make this more performant?
