Delay creating metadata in dask.to_cloudvolume #309
Merged
william-silversmith merged 2 commits intoseung-lab:masterfrom Jan 20, 2020
chrisroat:delay_metadata
Merged
Delay creating metadata in dask.to_cloudvolume #309william-silversmith merged 2 commits intoseung-lab:masterfrom chrisroat:delay_metadata
william-silversmith merged 2 commits intoseung-lab:masterfrom
chrisroat:delay_metadata
Conversation
2 tasks
Contributor
Author
|
I've updated this PR to always "delay" the initial creation of info and provenance metadata. In reality, all this does is create a dependency such that the metadata is written prior to the data. It simply moves metadata creation to compute-time instead of during dask graph-computation time. For immediate computation, this shouldn't show an effect. For delayed computation, this will move the creation to a worker. I think it is ready to go. The travis failure seems unrelated. |
Contributor
|
lgtm, tests pass when running with the appropriate credentials |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I was finding a huge slowdown when opening many volumes for writing on a cloud filesystem. All the metadata files were being written from my machine -- instead of being sharded to workers.
In addition, when compute=False, it may not be the intention of the caller to ever write out data. The PR delays the writing of
infoorprovenanceuntil compute time.For these reasons, I think we can delay writing the metadata. It's an open question to me if we should always wrap in
dask.delayed, regardless of the value of compute. I believe this may be better, and am trying to open a discussion with the dask maintainers on a similar PR in their repo - dask/dask#5797