Skip to content

Delay creating metadata in to_zarr#5797

Merged
TomAugspurger merged 4 commits intodask:masterfrom
chrisroat:zarr_delay_metadata
Jan 20, 2020
Merged

Delay creating metadata in to_zarr#5797
TomAugspurger merged 4 commits intodask:masterfrom
chrisroat:zarr_delay_metadata

Conversation

@chrisroat
Copy link
Contributor

  • Tests added / passed
  • Passes black dask / flake8 dask

@chrisroat
Copy link
Contributor Author

Upon further reflection, it might be appropriate to always create the metadata in a delayed object. It's not clear to me that there should be immediate effects from the code that are outside of a running task. Would love some direction from the maintainers on whether to update this PR to always wrap array creation in a delayed object regardless of the value of compute.

Requesting a look over from @martindurant, who originally include this method in #3460

@TomAugspurger
Copy link
Member

Is there an issue with context for this change? Just seung-lab/cloud-volume#309?

@martindurant
Copy link
Member

This seems like a reasonable thing, although the code isn't too obvious, so would like some comments where creating the delayed call.

Is it possible that the metadata is created after the first chunk is ready, so zarr is unable to write that chunk?

Note that CI is failing because you haven't run black.

@chrisroat
Copy link
Contributor Author

Apologies for not running black. I thought I had, and only the svg file (not part of this PR) had changed. Added the cleanup.

@TomAugspurger There is no issue. I simply ran into a problem (opening 2k GCS files takes tens of minutes serially on one machine) and am attempting to solve it. The problem occurs for both zarr and cloudvolume, which is why I have 2 similar PRs open.

@martindurant Yes, the race condition is one thing I'm worried about. Also, I worry that it won't work with some combination of distributed/processes/threads.

I also updated comments. One substantive change in behavior is that the code no longer fails fast if overwrite=True and an array already exists. While potentially annoying, I believe that is the right solution and a good trade-off when opening many files.

Here are the differences in the task graphs with and without the delayed create task. I am fairly new to dask, so not postive how to read them. But it looks like the create call is done in parallel and the storage is dependent on create being done first. Thoughts?

Code:

import dask.array as da
task = da.zeros(2, chunks=1).to_zarr('/tmp/zarr_array', compute=False, overwrite=True)
task.visualize()

Old code:
download-1

New code:
download

@martindurant
Copy link
Member

That shows that the store tasks (which do the writing) depend on the create task having run, so all should be good, no race.

@chrisroat
Copy link
Contributor Author

My follow-up question is whether the create task should always be delayed. Is there any harm in that, since it is basically an invisible intermediate task?

@martindurant
Copy link
Member

Without checking the detail of da.store, I am not certain, but it seems reasonable to always delay.

@chrisroat
Copy link
Contributor Author

OK, I went ahead and made it always delayed.

Let me know what you think, and if we should solicit other opinions.

@chrisroat chrisroat changed the title Delay creating metadata in to_zarr(compute=False) Delay creating metadata in to_zarr Jan 17, 2020
@martindurant
Copy link
Member

Good for me, I think. @TomAugspurger , any thoughts?

@TomAugspurger
Copy link
Member

Yes, the race condition is one thing I'm worried about. Also, I worry that it won't work with some combination of distributed/processes/threads.

I'm also concerned about two threads trying to do creation stuff. Is Zarr threadsafe in this way?

@martindurant
Copy link
Member

Where are two threads creating, did I miss something?

@TomAugspurger
Copy link
Member

Mmm, perhaps I'm the one that's mistaken. The difference is that on master, we run zarr.create in the client when the graph is created. Now we run it on a worker when the graph is executed? But the actual ordering is still the same since later tasks depend on the delayed create?

@martindurant
Copy link
Member

Right, that's how it should now work.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jan 20, 2020

OK, thanks for clarifying.

Thanks @chrisroat!

@TomAugspurger TomAugspurger merged commit a27f111 into dask:master Jan 20, 2020
jsignell pushed a commit that referenced this pull request Jun 3, 2021
The change to wrap zarr.create in dask.delayed was originally made by me in #5797.  I did this when I was still new to dask, and when it made sense (to me) to defer any file operations, including the initial metadata creation, until run time.

After having worked with the dask ecosystem for a while, I see that this is not the typical paradign for I/O.  A call to to_zarr should immediately create the zarr metadata, so as not to surprise others.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants