Skip to content

Simplify to_parquet compute path#8982

Merged
jcrist merged 1 commit intodask:mainfrom
jcrist:to-parquet-compute
Apr 26, 2022
Merged

Simplify to_parquet compute path#8982
jcrist merged 1 commit intodask:mainfrom
jcrist:to-parquet-compute

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Apr 26, 2022

Previously to_parquet would either create a new Scalar or construct
and compute a graph directly. Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between:

df.to_parquet()
df.to_parquet(compute=False).compute()

To fix this, we remove the branch and make these equivalent statements.

Previously `to_parquet` would either create a new `Scalar` or construct
and compute a graph directly. Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between:

```
df.to_parquet()
df.to_parquet(compute=False).compute()
```

To fix this, we remove the branch and make these equivalent statements.
Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcrist.

To be honest, I'd be in favor of removing the immediate option all-together, though that would need a deprecation cycle.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Apr 26, 2022

To be honest, I'd be in favor of removing the immediate option all-together

Why? Most of our writing functions compute immediately (so if we changed things here we'd want to change them everywhere else as well), keeping this as is seems fine to me and changing it would be a major breaking change.

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jcrist -- nice find

Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between

Could you elaborate a bit more on these differences? It's still not clear to me why this change fixes the issue (though I'm sure you're correct and it does). Are graph optimizations not being properly applied? Does this mean compute_as_if_collection isn't behaving as expected?

@ian-r-rose
Copy link
Copy Markdown
Collaborator

Most of our writing functions compute immediately

Yes, I kind of wish they didn't, I routinely accidentally trigger immediate compute. It's a bigger discussion than this change here though.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Apr 26, 2022

Could you elaborate a bit more on these differences? It's still not clear to me why this change fixes the issue

To be honest I didn't delve into what the difference was, just noticed that compute_as_if_collection wasn't necessary here, and since the .compute() version performed better this change seemed good and sufficient. compute_as_if_collection is similar but not exactly equal to compute, but I'm not sure what caused the difference exactly. Happy to debug further if needed.

@jcrist jcrist merged commit 2340887 into dask:main Apr 26, 2022
@jcrist jcrist deleted the to-parquet-compute branch April 26, 2022 17:47
@jcrist jcrist self-assigned this Apr 27, 2022
@jrbourbeau
Copy link
Copy Markdown
Member

I was able to confirm that this fix indeed fixes the performance issue we were seeing (thank you @jcrist 🎉). That said, it's still not clear to me why compute_as_if_collection was leading to things going so poorly. Super glad we've fixed things here in to_parquet as it's a highly used API. That said, we use compute_as_if_collection in other places. It'd be good to know if we're running into similar performance issues there too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants