Simplify `to_parquet` compute path by jcrist · Pull Request #8982 · dask/dask

jcrist · 2022-04-26T15:26:04Z

Previously to_parquet would either create a new Scalar or construct
and compute a graph directly. Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between:

df.to_parquet()
df.to_parquet(compute=False).compute()

To fix this, we remove the branch and make these equivalent statements.

Previously `to_parquet` would either create a new `Scalar` or construct and compute a graph directly. Due to slight differences between these code paths a user could see different scheduling and performance behavior between: ``` df.to_parquet() df.to_parquet(compute=False).compute() ``` To fix this, we remove the branch and make these equivalent statements.

ian-r-rose

Thanks @jcrist.

To be honest, I'd be in favor of removing the immediate option all-together, though that would need a deprecation cycle.

jcrist · 2022-04-26T15:43:03Z

To be honest, I'd be in favor of removing the immediate option all-together

Why? Most of our writing functions compute immediately (so if we changed things here we'd want to change them everywhere else as well), keeping this as is seems fine to me and changing it would be a major breaking change.

jrbourbeau

Thanks @jcrist -- nice find

Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between

Could you elaborate a bit more on these differences? It's still not clear to me why this change fixes the issue (though I'm sure you're correct and it does). Are graph optimizations not being properly applied? Does this mean compute_as_if_collection isn't behaving as expected?

ian-r-rose · 2022-04-26T15:47:27Z

Most of our writing functions compute immediately

Yes, I kind of wish they didn't, I routinely accidentally trigger immediate compute. It's a bigger discussion than this change here though.

jcrist · 2022-04-26T15:56:34Z

Could you elaborate a bit more on these differences? It's still not clear to me why this change fixes the issue

To be honest I didn't delve into what the difference was, just noticed that compute_as_if_collection wasn't necessary here, and since the .compute() version performed better this change seemed good and sufficient. compute_as_if_collection is similar but not exactly equal to compute, but I'm not sure what caused the difference exactly. Happy to debug further if needed.

jrbourbeau · 2022-04-27T21:38:10Z

I was able to confirm that this fix indeed fixes the performance issue we were seeing (thank you @jcrist 🎉). That said, it's still not clear to me why compute_as_if_collection was leading to things going so poorly. Super glad we've fixed things here in to_parquet as it's a highly used API. That said, we use compute_as_if_collection in other places. It'd be good to know if we're running into similar performance issues there too

github-actions bot added dataframe io labels Apr 26, 2022

jcrist mentioned this pull request Apr 26, 2022

read_parquet + shuffle + to_parquet performance observation coiled/benchmarks#79

Closed

ian-r-rose approved these changes Apr 26, 2022

View reviewed changes

jrbourbeau reviewed Apr 26, 2022

View reviewed changes

jcrist merged commit 2340887 into dask:main Apr 26, 2022

jcrist deleted the to-parquet-compute branch April 26, 2022 17:47

jrbourbeau mentioned this pull request Apr 26, 2022

Confirm upstream to_parquet performance fix coiled/benchmarks#93

Merged

jcrist self-assigned this Apr 27, 2022

ian-r-rose mentioned this pull request Apr 27, 2022

Investigate compute_as_if_collection for performance issues #8991

Closed

bryanwweber added the parquet label Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify `to_parquet` compute path#8982

Simplify `to_parquet` compute path#8982
jcrist merged 1 commit intodask:mainfrom
jcrist:to-parquet-compute

jcrist commented Apr 26, 2022

Uh oh!

ian-r-rose left a comment

Uh oh!

jcrist commented Apr 26, 2022

Uh oh!

jrbourbeau left a comment

Uh oh!

ian-r-rose commented Apr 26, 2022

Uh oh!

jcrist commented Apr 26, 2022

Uh oh!

jrbourbeau commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jcrist commented Apr 26, 2022

Uh oh!

ian-r-rose left a comment

Choose a reason for hiding this comment

Uh oh!

jcrist commented Apr 26, 2022

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

ian-r-rose commented Apr 26, 2022

Uh oh!

jcrist commented Apr 26, 2022

Uh oh!

jrbourbeau commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants