Delay creating metadata in to_zarr#5797
Delay creating metadata in to_zarr#5797TomAugspurger merged 4 commits intodask:masterfrom chrisroat:zarr_delay_metadata
Conversation
|
Upon further reflection, it might be appropriate to always create the metadata in a delayed object. It's not clear to me that there should be immediate effects from the code that are outside of a running task. Would love some direction from the maintainers on whether to update this PR to always wrap array creation in a delayed object regardless of the value of Requesting a look over from @martindurant, who originally include this method in #3460 |
|
Is there an issue with context for this change? Just seung-lab/cloud-volume#309? |
|
This seems like a reasonable thing, although the code isn't too obvious, so would like some comments where creating the delayed call. Is it possible that the metadata is created after the first chunk is ready, so zarr is unable to write that chunk? Note that CI is failing because you haven't run |
|
Apologies for not running black. I thought I had, and only the svg file (not part of this PR) had changed. Added the cleanup. @TomAugspurger There is no issue. I simply ran into a problem (opening 2k GCS files takes tens of minutes serially on one machine) and am attempting to solve it. The problem occurs for both zarr and cloudvolume, which is why I have 2 similar PRs open. @martindurant Yes, the race condition is one thing I'm worried about. Also, I worry that it won't work with some combination of distributed/processes/threads. I also updated comments. One substantive change in behavior is that the code no longer fails fast if Here are the differences in the task graphs with and without the delayed create task. I am fairly new to dask, so not postive how to read them. But it looks like the create call is done in parallel and the storage is dependent on create being done first. Thoughts? Code: |
|
That shows that the store tasks (which do the writing) depend on the create task having run, so all should be good, no race. |
|
My follow-up question is whether the create task should always be delayed. Is there any harm in that, since it is basically an invisible intermediate task? |
|
Without checking the detail of da.store, I am not certain, but it seems reasonable to always delay. |
|
OK, I went ahead and made it always delayed. Let me know what you think, and if we should solicit other opinions. |
to_zarr(compute=False)|
Good for me, I think. @TomAugspurger , any thoughts? |
I'm also concerned about two threads trying to do creation stuff. Is Zarr threadsafe in this way? |
|
Where are two threads creating, did I miss something? |
|
Mmm, perhaps I'm the one that's mistaken. The difference is that on master, we run |
|
Right, that's how it should now work. |
|
OK, thanks for clarifying. Thanks @chrisroat! |
The change to wrap zarr.create in dask.delayed was originally made by me in #5797. I did this when I was still new to dask, and when it made sense (to me) to defer any file operations, including the initial metadata creation, until run time. After having worked with the dask ecosystem for a while, I see that this is not the typical paradign for I/O. A call to to_zarr should immediately create the zarr metadata, so as not to surprise others.


black dask/flake8 dask