Support zarr sharding through create_array#12153
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ± 0 9 suites ±0 3h 5m 1s ⏱️ +43s Results for commit daf2dcd. ± Comparison against base commit dce3ec5. ♻️ This comment has been updated with latest results. |
|
Zarr v3 is not supported in python 3.10. I will adjust the PR to take it into account. |
dask/array/core.py
Outdated
| arr, | ||
| url, | ||
| component=None, | ||
| shard_factors=None, |
There was a problem hiding this comment.
it might be easier to copy the signature for create_array as much as possible, which would argue for having chunks and shards parameters, both of which could default to "auto".
There was a problem hiding this comment.
I will wait for a bit more feedback, but I would be happy to make the change if everyone agrees.
There was a problem hiding this comment.
Ideally this would somehow match zarr-developers/zarr-python#3574 - if we go with shards_factor here, I would think the same API should be in zarr-python
There was a problem hiding this comment.
it might be easier to copy the signature for create_array as much as possible
This /feels/ right to me too. It's easier for dask to just do that. In Xarray, this escape hatch is the "encoding" kwarg which is passed directly to the storage backed (Zarr in this case). I wonder if a similar zarr_kwargs is better. anything in there gets forwarded to the user.
There was a problem hiding this comment.
if we go with something like zarr_kwargs, we could model zarr_kwargs as a typeddict version of the create_array signature. Sticking with whatever zarr-python is doing under the hood seems like a better approach than creating new parameters.
There was a problem hiding this comment.
sorry had a week of scverse conference inbetween. Picking this up now
|
Also @joshmoore, would be happy to hear thoughts about the API also for exposing the API in ome-zarr-py. |
Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
|
I will make some additional changes and resolve the merge conflict, but I am inbetween flights at the moment. Feedback regarding Regarding tests, would it be safe to assume that these are mostly just taken care off already by |
yes |
jacobtomlinson
left a comment
There was a problem hiding this comment.
Overall this looks good to me.
@dcherian @d-v-b can you confirm you are happy with the API design of zarr_kwargs? That seems to be the only open review comment from you left.
@melonora could you take a look at the typing_extensions dependency, I think we need to be more explicit.
cc @TomAugspurger @jakirkham for visibility. You are familiar with zarr's implementation than I am and might have thoughts.
TomAugspurger
left a comment
There was a problem hiding this comment.
Looks good at a glance, thanks. Just a couple small questions.
|
@jacobtomlinson If current API is approved I can still increase coverage if required, though as stated by @d-v-b most of the actual behaviour can be assumed to be tested by zarr-python. If there are any additional comments to address I can do so tomorrow. |
jacobtomlinson
left a comment
There was a problem hiding this comment.
Overall I'm happy with this change. Thanks for iterating so much here.
I'm not worried about the failing checks, upstream is a little broken due to Pandas 3 changes, and coverage is fine for this situation.
It feels like the API design discussion has run it's course here. If @d-v-b or @dcherian have any more feedback it can happen in a follow up PR.
No problem, happy to get this over the line. Thanks for reviewing everyone! |
pre-commit run --all-filesDespite #11778 already being closed, dask did not support writing a sharded zarr array when not passing a url which was already a zarr array. The function
to_zarrused thezarr.createfunction which does not support the sharding API and is also targeted for deprecation. In this PR we switch tozarr.create_arrayinstead when zarr v3 is installed. For backward compatibility the old API call is still available and is equivalent to the old implementation.In
to_zarra regularly chunked array is already enforced so sharding is implemented by a parametershard_factorswhich multiplies the chunksize for each dimension by the corresponding shard factor, e.g. chunksize of (3,3) and shard_factors of (4,4) will result in shards being (12,12). If this results in a partial shard being written a warning is given to the user. Also if the chunk size is the same as the array shape, thenshard_factorsis set to 1 for each dimension if the user provided shard_factors as higher than 1.Additionally I refactored
to_zarrand some other functions to reduce their size. Additionally, for those parts of the codebase I touched, I replacedstorebyzarr_storeas there is a functionstorein the same script.@will-moore, @LucaMarconato, @d-v-b
I would be happy to discuss any implementations. Also, should I throw a
PerformanceWarninginstead when partial shards are written?