Add name kwarg to from_zarr (#1)#4663
Conversation
|
For some context, currently the import zarr
import dask
import dask.array as da
a = zarr.array([1, 2, 3])
b = zarr.array([4, 5, 6])
a_dsk = da.from_zarr(a)
b_dsk = da.from_zarr(b)
print(a_dsk.name)
print(b_dsk.name)
print(dask.base.tokenize(a_dsk) == dask.base.tokenize(b_dsk))produces the following output on the current The goal of this PR is to provide a fix such that |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thank you for the PR @mpeaton! Could you add a test to dask/array/tests/test_array_core.py to demonstrate that the changes here provide the desired behavior for from_zarr
|
cc @jakirkham as you may be interested based on your discussion in zarr-developers/zarr-python#202 |
|
I have run such a test locally. The capability is innate to the from_array module, so in the original pass-through design, a test was not needed. The bug fix required was to allow proper unique hashing. Here is the test code I have been using. I suppose it would be possible to built a separate pytest to test "again" the tokenizer, though I would think it to be redundant at present. With regards to how Dask handles creating unique hashes of large datasets in general, my intuition would be that this implies a bigger problem. For the sake of the immediate customer requirement, it seems low risk to utilize the proposed strategy for the bug fix, and devise a new strategy in a performance enhancement. At any rate, here is my current (external) unit test: |
|
I agree that def test_from_zarr_unique_name():
a = zarr.array([1, 2, 3])
b = zarr.array([4, 5, 6])
assert da.from_zarr(a).name != da.from_zarr(b).namewould act as a safeguard to ensure that this bug doesn't pop up again in the future. In addition, since this PR adds the new def test_from_zarr_name():
a = zarr.array([1, 2, 3])
assert da.from_zarr(a, name='foo').name == 'foo' |
|
You may want to rebase on master to take those image and arrow xfails out of the diff |
|
Why is appveryor having trouble with asciitree? I am running conda 4.6.8 locally, and the dependency of zarr on asciitree and fasteners is handled appropriately. I am new to appveyor, what is going on here? ModuleNotFoundError: No module named 'asciitree' |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for adding a test @mpeaton! I've included a few comments
I think we can avoid adding zarr to the Appveyor CI in this PR. While I'm in favor of this addition generally (thank you @mpeaton for noticing zarr is not being installed in the Appveyor environment), since it doesn't seem to be as simple as adding conda install zarr we can open up a separate issue for this.
Also there are a couple of minor code linting issues https://travis-ci.org/dask/dask/jobs/516317003#L3142-L3145
jrbourbeau
left a comment
There was a problem hiding this comment.
Looks like other commits are showing up here, you may want to rebase on master to take the parquet fix out of the diff.
Since we're adding the name keyword to from_zarr, could you add a test to ensure that when name is given, the output dask array has the specified name. Something like the following should work:
def test_from_zarr_name():
zarr = pytest.importorskip('zarr')
a = zarr.array([1, 2, 3])
assert da.from_zarr(a, name='foo').name == 'foo'|
The test failures showing up on Travis CI are related to a recent bokeh update and have been fixed in #4680. Other than updating the |
|
Checking in here. Is there anything left to do on this PR? |
|
There is one lingering (minor) comment about modifying the |
mpeaton
left a comment
There was a problem hiding this comment.
name: str, optional
The key name used for the array. Defaults to the string 'from-zarr-' prepended to a hash of x.
|
Thanks, I like your suggestion @mpeaton. One very minor comment, since there is no input parameter Also, just as a note, currently the output from Line 642 in a2ac894 |
|
@jrbourbeau I see. Would it be concise, yet accurate enough to replace x with url? |
Thinking about this again, perhaps the most accurate, yet concise, description would be a general statement about hashing the input (since we're Lines 134 to 135 in 91ce8c8 A similar statement would work here too. "An optional key name for the array. Defaults to hashing the input." Apologies for going back and forth on this point.
Data should be loaded lazily |
* Add name kwarg to from_zarr * added default from_zarr naming convention per jrbourbeau * flake8 fix
Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com>
jrbourbeau
left a comment
There was a problem hiding this comment.
LGTM, merging this afternoon if there are no additional comments
|
Thanks @mpeaton! I noticed this was your first code contribution to this repository, welcome! |
* Add name kwarg to from_zarr (dask#1) * Add name kwarg to from_zarr * added default from_zarr naming convention per jrbourbeau * flake8 fix * Added test for from_zarr name hashing * added name test * update from_zarr docstring * grammer curl Co-Authored-By: mpeaton <mpeaton@users.noreply.github.com> * python-snappy
Add name kwarg to from_zarr
added default from_zarr naming convention per jrbourbeau
flake8 fix
flake8 dask