Skip to content

Investigate compute_as_if_collection for performance issues #8991

@ian-r-rose

Description

@ian-r-rose

Over here we identified a case where writing a dataframe to_parquet with compute=True resulted in much slower (~10x) write times compared to compute=False, then calling compute() on the resulting scalar.

Those two different code paths were more different than they needed to be, with one using dask.base.compute_as_if_collection, and the other using a dd.Scalar directly. In #8982 we consolidated those two code paths to just use Scalar, and this seemingly fixed the issue. However, it's still concerning that compute_as_if_collection had such poor performance: this is used in a number of places throughout the codebase. It could be that there are some optimizations that are not surviving the process.

Opening this issue to track follow-up investigations to #8982.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions