Conversation
Previously `to_parquet` would either create a new `Scalar` or construct and compute a graph directly. Due to slight differences between these code paths a user could see different scheduling and performance behavior between: ``` df.to_parquet() df.to_parquet(compute=False).compute() ``` To fix this, we remove the branch and make these equivalent statements.
ian-r-rose
left a comment
There was a problem hiding this comment.
Thanks @jcrist.
To be honest, I'd be in favor of removing the immediate option all-together, though that would need a deprecation cycle.
Why? Most of our writing functions compute immediately (so if we changed things here we'd want to change them everywhere else as well), keeping this as is seems fine to me and changing it would be a major breaking change. |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @jcrist -- nice find
Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between
Could you elaborate a bit more on these differences? It's still not clear to me why this change fixes the issue (though I'm sure you're correct and it does). Are graph optimizations not being properly applied? Does this mean compute_as_if_collection isn't behaving as expected?
Yes, I kind of wish they didn't, I routinely accidentally trigger immediate compute. It's a bigger discussion than this change here though. |
To be honest I didn't delve into what the difference was, just noticed that |
|
I was able to confirm that this fix indeed fixes the performance issue we were seeing (thank you @jcrist 🎉). That said, it's still not clear to me why |
Previously
to_parquetwould either create a newScalaror constructand compute a graph directly. Due to slight differences between these
code paths a user could see different scheduling and performance
behavior between:
To fix this, we remove the branch and make these equivalent statements.