-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Over here we identified a case where writing a dataframe to_parquet with compute=True resulted in much slower (~10x) write times compared to compute=False, then calling compute() on the resulting scalar.
Those two different code paths were more different than they needed to be, with one using dask.base.compute_as_if_collection, and the other using a dd.Scalar directly. In #8982 we consolidated those two code paths to just use Scalar, and this seemingly fixed the issue. However, it's still concerning that compute_as_if_collection had such poor performance: this is used in a number of places throughout the codebase. It could be that there are some optimizations that are not surviving the process.
Opening this issue to track follow-up investigations to #8982.