Do not return delayed object from to_zarr#7738
Conversation
The change to wrap zarr.create in dask.delayed was originally made by me in #5797. I did this when I was still new to dask, and when it made sense (to me) to defer any file operations, including the initial metadata creation, until run time. After having worked with the dask ecosystem for a while, I see that this is not the typical paradign for I/O. A call to to_zarr should immediately create the zarr metadata, so as not to surprise others.
jsignell
left a comment
There was a problem hiding this comment.
Can you describe why this behavior is preferable to returning a delayed object? Is there an issue that this adresses?
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for the PR @chrisroat! Similar to @jsignell, I'm also wondering what motivated this change. It looks like this is reverting an earlier PR you authored #5797
|
I updated the PR with the text fo the commit -- sorry, about that. Basically, the current behavior runs contrary to the general form of I/O in dask. At graph construction time, the metadata is typically written -- e.g. the .zarray file is written. This seems to be what happens throughout the dask ecosystem with other output formats or with other dask types. By using delayed, the metadata isn't written until runtime -- i.e., it is done as a task. There are pros/cons to either approach, and when I initially wrote this it helped me avoid a ton of I/O at graph construction time (think deleting terabytes if overwrite is True). But over time, I noticed that it runs contrary to everything else. For example, if the location is unwritable (no permissions, or an object already exists at the location and overwrite is False), the check isn't done until runtime.... which may be a long way into a calculation. |
|
Thanks for the explanation @chrisroat! Do you anticipate running into the same issues that prompted you to make the original change in #5797? If so would it maybe work to add some documentation containing a workaround? |
|
I think the issues that arose would be something that people should/are already familiar with, since this is just reverting behavior to be in line with the rest of dask. For example, if one uses I haven't though about the possible work-around, but one could wrap the to_zarr call in a delayed object, and that might have the same effect. |
|
That works for me! I am merging now |
The change to wrap zarr.create in dask.delayed was originally made by me in #5797. I did this when I was still new to dask, and when it made sense (to me) to defer any file operations, including the initial metadata creation, until run time.
After having worked with the dask ecosystem for a while, I see that this is not the typical paradign for I/O. A call to to_zarr should immediately create the zarr metadata, so as not to surprise others.
black dask/flake8 dask/isort dask