-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
dask 2.3.0
my calls to to_parquet were never starting - meaning after waiting a long time I still didn't see the tasks appear in the "progress" bar which, to my untrained eye, means that the graph is taking a long time to compute.
I copy pasted the code from dataframe/core.py to_parquet to build my own parts
https://github.com/dask/dask/blob/master/dask/dataframe/io/parquet/core.py#L458-L481
# write parts
dwrite = delayed(engine.write_partition)
parts = [
dwrite(
d,
path,
fs,
filename,
partition_on,
write_metadata_file,
fmd=meta,
index_cols=index_cols,
**kwargs_pass
)
for d, filename in zip(df.to_delayed(), filenames)
]
# single task to complete
out = delayed(lambda x: None)(parts)
if write_metadata_file:
out = delayed(engine.write_metadata)(
parts, meta, fs, path, append=append, compression=compression
)
I then manually put this in chunks into client.compute
if I do 10 parts, after 27s my tasks appear. If I do 100 parts, after 4.5 minutes (270s) my tasks appear. My dataset has 4000 partitions so, at this rate, it will be 180 minutes before my task graph will even appear and start executing.
For comparison it takes ~2hours to do the full execution on dask 0.2.1.