Support zarr sharding through create_array by melonora · Pull Request #12153 · dask/dask

melonora · 2025-11-12T15:33:10Z

Closes Support for sharding when storing dask arrays to zarr #11778
Tests added / passed
Passes pre-commit run --all-files

Despite #11778 already being closed, dask did not support writing a sharded zarr array when not passing a url which was already a zarr array. The function to_zarr used the zarr.create function which does not support the sharding API and is also targeted for deprecation. In this PR we switch to zarr.create_array instead when zarr v3 is installed. For backward compatibility the old API call is still available and is equivalent to the old implementation.

In to_zarr a regularly chunked array is already enforced so sharding is implemented by a parameter shard_factors which multiplies the chunksize for each dimension by the corresponding shard factor, e.g. chunksize of (3,3) and shard_factors of (4,4) will result in shards being (12,12). If this results in a partial shard being written a warning is given to the user. Also if the chunk size is the same as the array shape, then shard_factors is set to 1 for each dimension if the user provided shard_factors as higher than 1.

Additionally I refactored to_zarr and some other functions to reduce their size. Additionally, for those parts of the codebase I touched, I replaced store by zarr_store as there is a function store in the same script.

@will-moore, @LucaMarconato, @d-v-b

I would be happy to discuss any implementations. Also, should I throw a PerformanceWarning instead when partial shards are written?

github-actions · 2025-11-12T16:10:09Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

9 files ± 0 9 suites ±0 3h 5m 1s ⏱️ +43s
18 159 tests + 3 16 944 ✅ + 5 1 215 💤 ±0 0 ❌ - 2
162 568 runs +27 150 559 ✅ +21 12 009 💤 +8 0 ❌ - 2

Results for commit daf2dcd. ± Comparison against base commit dce3ec5.

♻️ This comment has been updated with latest results.

melonora · 2025-11-12T17:12:04Z

Zarr v3 is not supported in python 3.10. I will adjust the PR to take it into account.

dask/array/tests/test_array_core.py

dask/array/core.py

d-v-b · 2025-11-14T09:13:09Z

dask/array/core.py

    arr,
    url,
    component=None,
+    shard_factors=None,


@dcherian thoughts on this API?

it might be easier to copy the signature for create_array as much as possible, which would argue for having chunks and shards parameters, both of which could default to "auto".

I will wait for a bit more feedback, but I would be happy to make the change if everyone agrees.

Ideally this would somehow match zarr-developers/zarr-python#3574 - if we go with shards_factor here, I would think the same API should be in zarr-python

it might be easier to copy the signature for create_array as much as possible

This /feels/ right to me too. It's easier for dask to just do that. In Xarray, this escape hatch is the "encoding" kwarg which is passed directly to the storage backed (Zarr in this case). I wonder if a similar zarr_kwargs is better. anything in there gets forwarded to the user.

if we go with something like zarr_kwargs, we could model zarr_kwargs as a typeddict version of the create_array signature. Sticking with whatever zarr-python is doing under the hood seems like a better approach than creating new parameters.

sorry had a week of scverse conference inbetween. Picking this up now

melonora · 2025-11-14T10:31:09Z

Also @joshmoore, would be happy to hear thoughts about the API also for exposing the API in ome-zarr-py.

Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

melonora · 2025-11-28T00:41:50Z

I will make some additional changes and resolve the merge conflict, but I am inbetween flights at the moment. Feedback regarding zarr_kwargs would be appreciated. I still need to check the arguments of the old zarr.create as now there could be unexpected arguments.

Regarding tests, would it be safe to assume that these are mostly just taken care off already by zarr-python?

d-v-b · 2025-12-02T12:42:02Z

Regarding tests, would it be safe to assume that these are mostly just taken care off already by zarr-python?

yes

melonora · 2025-12-02T23:47:11Z

@d-v-b @dcherian If you could please have a look. The failing tests do not appear to be related to the PR.
Regarding ZarrKwargs, I did not add docstring but instead refer to the zarr.create_array documentation.
Thanks in advance for reviewing.

jacobtomlinson

Overall this looks good to me.

@dcherian @d-v-b can you confirm you are happy with the API design of zarr_kwargs? That seems to be the only open review comment from you left.

@melonora could you take a look at the typing_extensions dependency, I think we need to be more explicit.

cc @TomAugspurger @jakirkham for visibility. You are familiar with zarr's implementation than I am and might have thoughts.

dask/array/core.py

TomAugspurger

Looks good at a glance, thanks. Just a couple small questions.

dask/array/core.py

melonora · 2025-12-09T22:54:23Z

@jacobtomlinson If current API is approved I can still increase coverage if required, though as stated by @d-v-b most of the actual behaviour can be assumed to be tested by zarr-python. If there are any additional comments to address I can do so tomorrow.

jacobtomlinson

Overall I'm happy with this change. Thanks for iterating so much here.

I'm not worried about the failing checks, upstream is a little broken due to Pandas 3 changes, and coverage is fine for this situation.

It feels like the API design discussion has run it's course here. If @d-v-b or @dcherian have any more feedback it can happen in a follow up PR.

dask/array/core.py

melonora · 2025-12-10T10:14:49Z

Overall I'm happy with this change. Thanks for iterating so much here.

I'm not worried about the failing checks, upstream is a little broken due to Pandas 3 changes, and coverage is fine for this situation.

It feels like the API design discussion has run it's course here. If @d-v-b or @dcherian have any more feedback it can happen in a follow up PR.

No problem, happy to get this over the line. Thanks for reviewing everyone!

melonora added 2 commits November 12, 2025 15:58

support creating sharded zarr array

e3640df

ensure shard factor of 1 if chunksize == array.shape

27d402b

melonora changed the title ~~Create array~~ Support dask sharding through create_array Nov 12, 2025

jacobtomlinson requested a review from dcherian November 13, 2025 11:24

make compatible with zarr v2

7d760af