add to/read_zarr by martindurant · Pull Request #3460 · dask/dask

martindurant · 2018-05-02T19:09:06Z

Tests added / passed
Passes flake8 dask

martindurant · 2018-05-02T19:10:52Z

A simplistic approach to handling zarr.

Adding the get_mapper function feels rather wrong, and it would be much better if the filesystem instances provided a get_mapper method - something that could be implemented automatically in fsspec, if filesystems chose to adopt that framework.

Note that this will fail is the chunking is not regular.

Comments and opinions welcome.

martindurant · 2018-05-02T19:49:08Z

Ref #3302
(eventually) fixes #3457

mrocklin

Generally seems sensible. A few small comments. Looking forward to seeing where this goes.

mrocklin · 2018-05-02T20:19:52Z

dask/array/core.py

        return nonzero(self)

+    def to_zarr(self, *args, **kwargs):
+        to_zarr(self, *args, **kwargs)


The to_zarr function should probably return a delayed object if compute=False. This should pass it through

mrocklin · 2018-05-02T20:20:41Z

dask/array/core.py

+        z = zarr.open_array(mapper, mode='r', **kwargs)
+    else:
+        z = zarr.open_group(mapper, mode='r', **kwargs)[component]
+    return from_array(z, z.chunks, name='zarr-%s' % url)


Do we need to think about locks or Zarr synchronizers here? Or is it best to delegate that through kwargs?

I don't know what to do about that. If the chunks are guaranteed to exactly match, then there should not be any contention.

Not aware of any issues reading from Zarr in parallel (chunks matching or not).

Nope, no synchronization needed when reading.

mrocklin · 2018-05-02T20:21:55Z

dask/bytes/core.py

+        from hdfs3.mapping import HDFSMap
+        return HDFSMap(fs, path)
+    else:
+        raise ValueError('No mapper for protocol "%s"' % fs.protocol)


If only there was some sort of standard file system abstraction library that would have utility functions like good error messages when importing ...

jakirkham · 2018-05-04T20:40:06Z

dask/array/core.py

+        z = zarr.open_group(mapper, mode=mode).create_dataset(
+            component, shape=arr.shape, chunks=chunks, dtype=arr.dtype,
+            **kwargs)
+    return store(arr, z, compute=compute, return_stored=return_stored)


Though we might want to provide the option to specify a lock for writing.

If the dask array to be stored has regular chunks, then I think a lock is not needed, because the writes will be aligned with the chunks in the newly created zarr array. In that case could probably pass lock=False to store().

Not sure what to do if dask array has irregular chunks. Storing should still be possible but writes may not be aligned with zarr array chunks. Is it worth doing something like arr = rechunk(arr, z.chunks) just to be sure writes will be aligned?

It sounds like we should rechunk before calling store? Presumably it's better to rechunk beforehand.

Actually, it might be nice to centralize that logic within store itself with an optional chunks= parameter. Presumably we might choose to align dask.array chunks so that they align with, but are possibly larger than, the given chunks

I think that's exactly wha I suggested in the chunks issue. Here we should apply rechunk as given, including the option of not rechunking, and error if the chunks are not regular.

Agreed using rechunk is much better than locking (that is what we do).

As to how to irregular chunks should be dealt with, there is some discussion in issue ( #3302 ) about how to do this. Would be curious to hear your thoughts over there, @alimanfoo. 😉

A simple first pass solution may be just to raise if the chunks are irregular and request the user make them uniform.

Edit: Oops, sorry to duplicate @martindurant. Your post showed up after I posted this in the diff view. 😞

jakirkham · 2018-05-04T20:41:20Z

Thanks for doing this @martindurant. This looks great! :)

Expect @alimanfoo would be interested in looking at this. ;)

mrocklin · 2018-05-04T20:43:59Z

@jakirkham is also trying to maintain a list of optional dependencies in #3456

Do we know the minimum version of Zarr for which this will work?

jakirkham · 2018-05-04T20:54:32Z

Would encourage requiring 2.2.0 to start unless there is some reason to start with an older version. There is enough of a difference between 2.1.x and 2.2.0 that it would probably just be easier to start with a recent requirement. Am pretty sure that is what xarray is using as well. Though @rabernat and @jhamman would know more.

alimanfoo

Great to see this PR, will be very useful!

alimanfoo · 2018-05-04T21:37:24Z

dask/array/core.py

+    ----------
+    url: str
+        Location of the data. This can include a protocol specifier like s3://
+        for remote data.


I wonder if it's worth allowing this first argument to either be a string (in which case interpreted as url) or a mapping. Then you'd get full generality to use any type of store, including zip files, lmdb, etc.

A suitable mapping is any instance of collections.MutableMapping?

That sounds correct (relevant doc snippet below). There are some operations that can be optimized by defining some special internal methods, but they are not strictly required.

Note that any object implementing the MutableMapping interface from the collections module in the Python standard library can be used as a Zarr array store, as long as it accepts string (str) keys and bytes values.

ref: http://zarr.readthedocs.io/en/latest/api/storage.html#module-zarr.storage

That said, have seen a lot of work that you have been doing involving URLs and paths in the Dataframe's side of the code base (though am not to familiar with what is going on there). Maybe it's worth mentioning what you had in mind and how that compares to what has been done with Dataframe's already.

FWIW I think you could check if the argument is a string, if so interpret as URL, otherwise assume it's a mapping (let duck typing occur naturally by passing through to zarr). But if you want an explicit check, all store classes in zarr do sub-class from MutableMapping, so should be fine too.

I think the first of those two suggestions makes the most sense to me.

alimanfoo · 2018-05-04T21:38:09Z

dask/array/core.py

+        Data to store
+    url: str
+        Location of the data. This can include a protocol specifier like s3://
+        for remote data.


Again wonder if this could be string or mapping.

jakirkham · 2018-05-04T22:04:10Z

dask/array/core.py

    return Array(dsk, name, chunks, dtype=x.dtype)


+def from_zarr(url, component=None, storage_options=None, **kwargs):


It would be nice to have an option to override the chunks size. Generally have found the chunk size that we choose for storage on disk and what we choose for Dask are different. That might not be everyone's case though. So having it default to the on disk chunk seems sensible. Just the option to override would be good.

Should add that choosing a different chunk size to start is way faster than rechunking afterwards. So having the option is pretty important.

Dockstring for rechunk= parameter suggests that rechunk will allow various chunking schemes. Intent is to have None become the default, and decide which scheme is best for it.

jakirkham · 2018-05-09T05:48:37Z

dask/array/core.py

+    False
+    """
+    for chunks in chunkset:
+        if len(chunks) < 2:


Minor point: From a readability perspective == 1 is a bit clearer.

Is it possible to have an empty array with no chunks?

Good question! So if it is an empty array, chunkset is an empty tuple meaning we don't enter this for-loop. If it is a 0-D array, then we have chunkset as ((0,),). So chunks would be length 1. For any higher dimensionality array, there would need to at least be one chunk per dimension with some length meaning each would be at least length 1.

jakirkham · 2018-05-09T05:52:48Z

dask/array/core.py

+    """
+    import zarr
+    if rechunk is not False:
+        arr = arr.rechunk(rechunk)


This seems to allow None here, which would cause an exception. What would we like to do if rechunk is None?

Right, I haven't implemented any of my suggestions into rechunk yet, but presumably we will need to pick a default. If preferred, this could be called "default".

See the example simple rechunker I pushed. This could genuinely be a useful one, but as I commented before, I expect there to be a few of these, and we can discuss which makes the best default.

Agreed that does seem useful.

Should we pick one for None at this stage? Also should we specify strs as valid arguments for rechunk then?

We can pick now, or I can continue to make some more.
Not sure if that belongs in this PR, though. The set of rechunkers ought to have expensive tests. At this point, I pushed it only so that you could see what I had in mind.

Sure. We can save the discussion for another PR. Thanks for sharing.

Perhaps this should happen in a future PR? If so we should probably remove these lines.

IIUC we decided to add this to avoid locking during writes. If I'm misunderstanding, please feel free to correct me. We can certainly consider locking or other options. rechunking seems easiest.

That said, there were a lot of ideas floating around about how best to rechunk in different scenarios and make these available to the user. So wouldn't want us to restrict ourselves early while we are still exploring that.

Personally don't have a strong feelings as to whether we keep this in this PR given that last point. We may want to add a note to the docstring about regular chunks being required if we don't supply this. As we already raise for irregular chunks with a nice error message, we should avoid a lot of problems that could come up.

Happy to defer to others on this.

martindurant · 2018-05-09T16:10:38Z

So I pulled out the rechunk options for now, but implemented passing a mapping as suggested (simple string check).

martindurant · 2018-05-14T13:39:17Z

Anything more suggested here?

jakirkham · 2018-05-14T14:41:00Z

Would it be possible to have an optional chunks parameter in from_zarr to override using the on disk chunking?

martindurant · 2018-05-14T14:42:39Z

Certainly possible, that would be equivalent to from_zarr().rechunk(), right?

jakirkham · 2018-05-14T15:05:50Z

It would be passing the chunks into from_array. Turns out there’s a big performance difference between doing this and using rechunk afterwards.

martindurant · 2018-05-14T15:08:25Z

I hadn't realised that - now I understand your comment from before.

martindurant · 2018-05-16T22:48:32Z

dask/array/core.py

+        Rechunking to be applied to the array before storage, since zarr
+        requires a regular chunks scheme - passed to ``.rechunk()``.
+        If False, no rechunk operation is performed;
+        if the chunks are not regular, an exception is raised.


I find that this text now accurately describes the situation, and says nothing about how to rechunk, only that you might need to do it. The exception is readable, I'd like to think. Rechunking for storage should always be allowed, I think, and any function that passes on to store should allow it.

martindurant · 2018-05-22T18:59:38Z

Removing rechunk= parameter, since it only adds one line of code for users, and gives a decent error message. I am not convinced that this is the right thing to do, but indeed we can add it back later.

jakirkham · 2018-05-22T19:28:03Z

docs/source/array-creation.rst

+
+The `zarr`_ format is a chunk-wise binary array storage file format, with a good selection
+of encoding and compression options. Due to each chunk being stored in a separate file, it
+is ideal for parallel access i both reading and writing (for the latter, if the dask array


i -> in

jakirkham · 2018-05-22T19:30:55Z

Generally looks fine to me. Though kind of dealing with a fever ATM. So might not be the best reviewer. Sorry about that.

Thanks for working on this @martindurant. Looking forward to using it. 😄

alimanfoo · 2018-05-22T20:50:12Z

FWIW this is all looking good to me. Only thing I noticed is the overwrite_group parameter, I'm not sure this is quite right. I'll try to follow up with some explanation tomorrow.

alimanfoo · 2018-05-23T14:35:47Z

dask/array/core.py

+        mode = 'w' if overwrite_group else 'r+'
+        z = zarr.open_group(mapper, mode=mode).create_dataset(
+            component, shape=arr.shape, chunks=chunks, dtype=arr.dtype,
+            **kwargs)


I think it would be better to have an overwrite argument into this function, rather than an overwrite_group argument.

Then this whole if component is None: ... else: ... block could be replaced by:

z = zarr.create(shape=arr.shape, chunks=chunks, dtype=arr.dtype, store=mapper, path=component, overwrite=overwrite, **kwargs)

As well as being simpler code, this also would simplify the logic around whether to overwrite existing data, because the behaviour will be the same whether or not the component argument is provided. I.e., with this change, if the user provides overwrite=False, then if an array exists an exception will be raised. Conversely, if user provides overwrite=True, an existing array will be deleted and overwritten.

I didn't realise it could be done as simply as that!

Yeah the open... functions are sometimes a bit of a distraction.

Is it worth deprecating/consolidating them?

Yes worth considering. Raised zarr-developers/zarr-python#264 for discussion.

jakirkham · 2018-05-23T21:37:10Z

docs/source/changelog.rst

 Array
 +++++

+<<<<<<< HEAD


Think this crept in from a merge conflict resolution.

alimanfoo · 2018-05-23T21:39:03Z

dask/array/core.py

+    if component is None:
+        z = zarr.open_array(mapper, mode='r', **kwargs)
+    else:
+        z = zarr.open_group(mapper, mode='r', **kwargs)[component]


FWIW you could replace the if component is None else block here with:

z = zarr.Array(store=mapper, path=component, read_only=True, **kwargs)

jakirkham · 2018-05-23T21:39:07Z

dask/bytes/core.py



+def get_mapper(fs, path):
+    # This is not the right wayt o do this.


nit: wayt o -> way to

alimanfoo · 2018-05-23T21:40:51Z

dask/array/core.py

+        If given array already exists, overwrite=False will cause an error,
+        where overwrite=True will replace the existing data.
+    compute, return_stored: see ``store()``
+    kwargs: passed to zarr's open functions, e.g., compression options


...passed to the zarr.create() function...

alimanfoo · 2018-05-23T21:44:08Z

dask/array/core.py

+        Passed to ``da.from_array``, allows setting the chunks on
+        initialisation, if the chunking scheme in the on-disc dataset is not
+        optimal for the calculations to follow.
+    kwargs: passed to zarr's open functions.


If simplify below, then maybe update this doc line too. Although actually probably don't need kwargs at all, there isn't really anything user would want to pass I don't think, although doesn't hurt to leave it for future compat.

alimanfoo

All looks good to me.

alimanfoo · 2018-05-24T12:39:58Z

dask/array/core.py

+    return from_array(z, chunks, name='zarr-%s' % url)
+
+
+def to_zarr(arr, url, component=None, storage_options=None,


One more thing forgot to say. The url argument could default to None. This would have the effect of creating a new in-memory array. User would have to also provide return_stored=True to be able to receive the new array, so there could be a gotcha with current default return_stored=False (maybe that should be True here?).

Might be convenient in the case where you want to compute a result without having to bother about putting any data down on disk. I.e., given some Dask array d, would be nice to be able to just do z = d.to_zarr() and get back an in-memory zarr array. In my work there's plenty of cases where I use in-memory zarr arrays, because it's quick and convenient and data are small enough with compression to fit in memory.

Don't mind if you'd rather leave as-is, just a thought.

The url argument could default to None. This would have the effect of creating a new in-memory array

I would think this is stretching the idea of what this method does, and be unexpected. A better approach maybe would be #2741 (i.e., a URL explicitly starting with 'memory://', but the reason that PR is languishing is that it's not clear what should happen in distributed memory.

OK, no problem.

Would just passing in a Zarr Array instance for url work for that case? If so, maybe this is already possible with a slightly different set of parameters. ;)

FWIW I think it's good as-is. Might get a bit confusing if there is too much flexibility around what the url arg can be. And wouldn't gain much convenience, user would have to set up array themselves. My (lazy) use case is wanting to be able to do z = d.to_zarr(), which is like calling d.compute() but where the result is computed into an in-memory zarr array rather than numpy array. But happy to revisit that later, don't want to hold up this very nice PR :)

jakirkham · 2018-05-25T21:21:01Z

Took the liberty of resolving merge conflicts. Hope that is ok.

jakirkham · 2018-05-26T02:40:04Z

Sounds like we are happy with this. So going to get it in.

jakirkham · 2018-05-26T02:41:03Z

Thanks for working on this @martindurant. Very nice addition. Also thanks everyone for helping review. Will be great to play with this in the next release. :)

martindurant · 2018-05-27T23:44:15Z

Thanks for the rebase and conversation while I was away :)

jakirkham · 2018-06-13T16:29:14Z

dask/array/core.py

+    chunks = [c[0] for c in arr.chunks]
+    z = zarr.create(shape=arr.shape, chunks=chunks, dtype=arr.dtype,
+                    store=mapper, path=component, overwrite=overwrite, **kwargs)
+    return store(arr, z, compute=compute, return_stored=return_stored)


Missed that we weren't actually setting lock=False here. Fixing in PR ( #3607 ).

Martin Durant added 2 commits May 2, 2018 15:06

First stab at zarr

6f15fea

flake

36bb1f9

mrocklin reviewed May 2, 2018

View reviewed changes

return value and more tests

008eb79

martindurant changed the title ~~WIP: First stab at zarr~~ add to/read_zarr May 4, 2018

jakirkham reviewed May 4, 2018

View reviewed changes

jakirkham mentioned this pull request May 4, 2018

Adding from_zarr and to_zarr methods #3457

Closed

alimanfoo reviewed May 4, 2018

View reviewed changes

jakirkham reviewed May 4, 2018

View reviewed changes

jakirkham mentioned this pull request May 4, 2018

Adding to_dask/from_dask zarr-developers/zarr-python#198

Closed

Martin Durant added 2 commits May 7, 2018 09:58

Allow rechunk in to_zarr and check chunks for regularity

591974c

Dockstring for rechunk= parameter suggests that rechunk will allow various chunking schemes. Intent is to have None become the default, and decide which scheme is best for it.

Add import to examples for doctest

ea53c35

jakirkham reviewed May 9, 2018

View reviewed changes

Martin Durant added 2 commits May 9, 2018 11:55

Merge branch 'master' into zarr

e984ab0

Allow direct use pf mappers in to/read_zarr

d973d34

martindurant force-pushed the zarr branch from 323a55b to d973d34 Compare May 9, 2018 16:07

simpler length check

83b5aaf

fix

01b7ad9

Martin Durant added 2 commits May 15, 2018 15:52

Merge branch 'master' into zarr

4dd74de

Add doc and changelog [skip ci]

1940bb8

martindurant commented May 16, 2018

View reviewed changes

Martin Durant added 2 commits May 22, 2018 14:56

Merge branch 'master' into zarr

8d49fc7

remove rechunk parameter

9fb2f3f

jakirkham reviewed May 22, 2018

View reviewed changes

doc typo [skip ci]

75a4b8c

alimanfoo reviewed May 23, 2018

View reviewed changes

Simplify overwrite

03908cf

jakirkham reviewed May 23, 2018

View reviewed changes

docs/source/changelog.rst Outdated

Array

+++++

<<<<<<< HEAD

Copy link

Member

jakirkham May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this crept in from a merge conflict resolution.

alimanfoo reviewed May 23, 2018

View reviewed changes

jakirkham reviewed May 23, 2018

View reviewed changes

dask/bytes/core.py Outdated

def get_mapper(fs, path):

# This is not the right wayt o do this.

Copy link

Member

jakirkham May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wayt o -> way to

alimanfoo reviewed May 23, 2018

View reviewed changes

corrections

a8aa10e

alimanfoo approved these changes May 24, 2018

View reviewed changes

alimanfoo reviewed May 24, 2018

View reviewed changes

Merge remote-tracking branch 'dask/master' into 'martindurant/zarr'

9638605

jakirkham merged commit 79c181a into dask:master May 26, 2018

martindurant deleted the zarr branch May 27, 2018 23:44

jakirkham mentioned this pull request Jun 5, 2018

Using Dask Array's to_zarr method with Zarr Array fails zarr-developers/zarr-python#266

Closed

jakirkham reviewed Jun 13, 2018

View reviewed changes

chrisroat mentioned this pull request Jan 17, 2020

Delay creating metadata in to_zarr #5797

Merged

2 tasks

		return Array(dsk, name, chunks, dtype=x.dtype)


		def from_zarr(url, component=None, storage_options=None, **kwargs):



		def get_mapper(fs, path):
		# This is not the right wayt o do this.

		return from_array(z, chunks, name='zarr-%s' % url)


		def to_zarr(arr, url, component=None, storage_options=None,

Uh oh!

Conversation

martindurant commented May 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented May 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martindurant commented May 2, 2018

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham May 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented May 4, 2018

Uh oh!

mrocklin commented May 4, 2018

Uh oh!

jakirkham commented May 4, 2018

Uh oh!

alimanfoo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

martindurant commented May 2, 2018 •

edited

Loading

martindurant commented May 2, 2018 •

edited

Loading

jakirkham May 4, 2018 •

edited

Loading

alimanfoo left a comment •

edited

Loading

jakirkham May 22, 2018 •

edited

Loading