Skip to content

Do not return delayed object from to_zarr#7738

Merged
jsignell merged 1 commit intodask:mainfrom
deisseroth-lab:chrisroat-no-zarr-delayed
Jun 3, 2021
Merged

Do not return delayed object from to_zarr#7738
jsignell merged 1 commit intodask:mainfrom
deisseroth-lab:chrisroat-no-zarr-delayed

Conversation

@chrisroat
Copy link
Contributor

@chrisroat chrisroat commented Jun 1, 2021

The change to wrap zarr.create in dask.delayed was originally made by me in #5797. I did this when I was still new to dask, and when it made sense (to me) to defer any file operations, including the initial metadata creation, until run time.

After having worked with the dask ecosystem for a while, I see that this is not the typical paradign for I/O. A call to to_zarr should immediately create the zarr metadata, so as not to surprise others.

  • Tests added / passed
  • Passes black dask / flake8 dask / isort dask

The change to wrap zarr.create in dask.delayed was originally made by me in #5797.  I did this when I was still new to dask, and when it made sense (to me) to defer any file operations, including the initial metadata creation, until run time.

After having worked with the dask ecosystem for a while, I see that this is not the typical paradign for I/O.  A call to to_zarr should immediately create the zarr metadata, so as not to surprise others.
@github-actions github-actions bot added the array label Jun 1, 2021
@chrisroat chrisroat marked this pull request as ready for review June 1, 2021 08:25
Copy link
Member

@jsignell jsignell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe why this behavior is preferable to returning a delayed object? Is there an issue that this adresses?

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @chrisroat! Similar to @jsignell, I'm also wondering what motivated this change. It looks like this is reverting an earlier PR you authored #5797

@chrisroat
Copy link
Contributor Author

I updated the PR with the text fo the commit -- sorry, about that.

Basically, the current behavior runs contrary to the general form of I/O in dask. At graph construction time, the metadata is typically written -- e.g. the .zarray file is written. This seems to be what happens throughout the dask ecosystem with other output formats or with other dask types. By using delayed, the metadata isn't written until runtime -- i.e., it is done as a task.

There are pros/cons to either approach, and when I initially wrote this it helped me avoid a ton of I/O at graph construction time (think deleting terabytes if overwrite is True). But over time, I noticed that it runs contrary to everything else. For example, if the location is unwritable (no permissions, or an object already exists at the location and overwrite is False), the check isn't done until runtime.... which may be a long way into a calculation.

@jsignell
Copy link
Member

jsignell commented Jun 3, 2021

Thanks for the explanation @chrisroat! Do you anticipate running into the same issues that prompted you to make the original change in #5797? If so would it maybe work to add some documentation containing a workaround?

@chrisroat
Copy link
Contributor Author

I think the issues that arose would be something that people should/are already familiar with, since this is just reverting behavior to be in line with the rest of dask. For example, if one uses compute=False with dask.dataframe to_parquet.

I haven't though about the possible work-around, but one could wrap the to_zarr call in a delayed object, and that might have the same effect.

@jsignell
Copy link
Member

jsignell commented Jun 3, 2021

That works for me! I am merging now

@jsignell jsignell merged commit 498d639 into dask:main Jun 3, 2021
@chrisroat chrisroat deleted the chrisroat-no-zarr-delayed branch December 31, 2021 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants