Reducing read_metadata output size in pyarrow/parquet by rjzamora · Pull Request #5391 · dask/dask

rjzamora · 2019-09-10T20:16:49Z

This is possible fix for the large-dask-graph problem raised in #5357. We avoid passing around a pq.ParquetDataset "piece", and instead pass around the path and partition_keys members of each piece. This dramitically reduces the amount of metadata stored in the task graph.

Since this solution does require the metadata to be parsed by each task in read_partition, I am including the results of a simple benchmark (performed on my local machine):

import dask
import dask.dataframe as dd
import cloudpickle

ddf = dask.datasets.timeseries(...)
ddf.to_parquet(file, engine="pyarrow")

read_df = dd.read_parquet(file, engine="pyarrow")
fun = read_df.__dask_graph__().values()[0]
print("Graph Size:", len(cloudpickle.dumps(fun)))
%timeit read_df.compute()

OUTPUT...

OLD ~365 partitions:

Graph Size: 206328
5.71 s ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NEW ~365 partitions:

Graph Size: 2022
5.74 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

OLD ~36k partitions:

Graph Size: 2144624
16.5 s ± 648 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NEW ~36k partitions:

Graph Size: 2022
16 s ± 1.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Tests added / passed
Passes black dask / flake8 dask

…aph size

mrocklin · 2019-09-10T20:25:55Z

Since this solution does require the metadata to be parsed by each task in read_partition, I am including the results of a simple benchmark (performed on my local machine):

@rjzamora if you're willing to humor me, can you try one of these benchmarks also with the distributed scheduler? To do this you would add the following line of code sometime before you call .compute()

from dask.distributed import Client
client = Client()

mrocklin · 2019-09-10T20:26:15Z

I would expect this to accentuate the slowdown of moving large graphs around.

mrocklin · 2019-09-10T20:26:48Z

@birdsarah , this might help out some of your workloads (if you're still interested in engaging here). I suspect that this might have been slowing you down on many-partition workloads.

rjzamora · 2019-09-10T20:27:35Z

if you're willing to humor me, can you try one of these benchmarks also with the distributed scheduler? To do this you would add the following line of code sometime before you call .compute()

Absolutely - Sorry, should have done this the first time around :)

rjzamora · 2019-09-10T20:57:11Z

dask.distributed version of the benchmark above...

OLD ~365 partitions:

Graph Size: 206328
9.63 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NEW ~365 partitions:

Graph Size: 2022
8.54 s ± 191 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For dask.distributed, the ~36k-partition example takes ~13.6s for the new code, but does not work for the "old" code. I get the following output:

Click to Expand

``` /Users/rzamora/.local/lib/python3.7/site-packages/distributed/worker.py:2794: UserWarning: Large object of size 2.14 MB detected in task graph: (

future = client.submit(func, big_data)    # bad

big_future = client.scatter(big_data)     # good
future = client.submit(func, big_future)  # good

% (format_bytes(len(b)), s))
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: tcp://127.0.0.1:56387, threads: 3>>
Traceback (most recent call last):
File "/anaconda3/envs/dask-parquet-dev/lib/python3.7/site-packages/psutil/_common.py", line 342, in wrapper
ret = self._cache[fun]
AttributeError: 'Process' object has no attribute '_cache'