Skip to content

Support zarr sharding through create_array#12153

Merged
jacobtomlinson merged 17 commits intodask:mainfrom
melonora:create_array
Dec 10, 2025
Merged

Support zarr sharding through create_array#12153
jacobtomlinson merged 17 commits intodask:mainfrom
melonora:create_array

Conversation

@melonora
Copy link
Copy Markdown
Contributor

@melonora melonora commented Nov 12, 2025

Despite #11778 already being closed, dask did not support writing a sharded zarr array when not passing a url which was already a zarr array. The function to_zarr used the zarr.create function which does not support the sharding API and is also targeted for deprecation. In this PR we switch to zarr.create_array instead when zarr v3 is installed. For backward compatibility the old API call is still available and is equivalent to the old implementation.

In to_zarr a regularly chunked array is already enforced so sharding is implemented by a parameter shard_factors which multiplies the chunksize for each dimension by the corresponding shard factor, e.g. chunksize of (3,3) and shard_factors of (4,4) will result in shards being (12,12). If this results in a partial shard being written a warning is given to the user. Also if the chunk size is the same as the array shape, then shard_factors is set to 1 for each dimension if the user provided shard_factors as higher than 1.

Additionally I refactored to_zarr and some other functions to reduce their size. Additionally, for those parts of the codebase I touched, I replaced store by zarr_store as there is a function store in the same script.

@will-moore, @LucaMarconato, @d-v-b

I would be happy to discuss any implementations. Also, should I throw a PerformanceWarning instead when partial shards are written?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Nov 12, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ± 0        9 suites  ±0   3h 5m 1s ⏱️ +43s
 18 159 tests + 3   16 944 ✅ + 5   1 215 💤 ±0  0 ❌  - 2 
162 568 runs  +27  150 559 ✅ +21  12 009 💤 +8  0 ❌  - 2 

Results for commit daf2dcd. ± Comparison against base commit dce3ec5.

♻️ This comment has been updated with latest results.

@melonora melonora changed the title Create array Support dask sharding through create_array Nov 12, 2025
@melonora
Copy link
Copy Markdown
Contributor Author

Zarr v3 is not supported in python 3.10. I will adjust the PR to take it into account.

arr,
url,
component=None,
shard_factors=None,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian thoughts on this API?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be easier to copy the signature for create_array as much as possible, which would argue for having chunks and shards parameters, both of which could default to "auto".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will wait for a bit more feedback, but I would be happy to make the change if everyone agrees.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this would somehow match zarr-developers/zarr-python#3574 - if we go with shards_factor here, I would think the same API should be in zarr-python

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be easier to copy the signature for create_array as much as possible

This /feels/ right to me too. It's easier for dask to just do that. In Xarray, this escape hatch is the "encoding" kwarg which is passed directly to the storage backed (Zarr in this case). I wonder if a similar zarr_kwargs is better. anything in there gets forwarded to the user.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we go with something like zarr_kwargs, we could model zarr_kwargs as a typeddict version of the create_array signature. Sticking with whatever zarr-python is doing under the hood seems like a better approach than creating new parameters.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry had a week of scverse conference inbetween. Picking this up now

@melonora
Copy link
Copy Markdown
Contributor Author

Also @joshmoore, would be happy to hear thoughts about the API also for exposing the API in ome-zarr-py.

melonora and others added 3 commits November 14, 2025 11:35
@melonora
Copy link
Copy Markdown
Contributor Author

I will make some additional changes and resolve the merge conflict, but I am inbetween flights at the moment. Feedback regarding zarr_kwargs would be appreciated. I still need to check the arguments of the old zarr.create as now there could be unexpected arguments.

Regarding tests, would it be safe to assume that these are mostly just taken care off already by zarr-python?

@d-v-b
Copy link
Copy Markdown
Member

d-v-b commented Dec 2, 2025

Regarding tests, would it be safe to assume that these are mostly just taken care off already by zarr-python?

yes

@melonora
Copy link
Copy Markdown
Contributor Author

melonora commented Dec 2, 2025

@d-v-b @dcherian If you could please have a look. The failing tests do not appear to be related to the PR.
Regarding ZarrKwargs, I did not add docstring but instead refer to the zarr.create_array documentation.
Thanks in advance for reviewing.

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me.

@dcherian @d-v-b can you confirm you are happy with the API design of zarr_kwargs? That seems to be the only open review comment from you left.

@melonora could you take a look at the typing_extensions dependency, I think we need to be more explicit.

cc @TomAugspurger @jakirkham for visibility. You are familiar with zarr's implementation than I am and might have thoughts.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good at a glance, thanks. Just a couple small questions.

@melonora
Copy link
Copy Markdown
Contributor Author

melonora commented Dec 9, 2025

@jacobtomlinson If current API is approved I can still increase coverage if required, though as stated by @d-v-b most of the actual behaviour can be assumed to be tested by zarr-python. If there are any additional comments to address I can do so tomorrow.

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I'm happy with this change. Thanks for iterating so much here.

I'm not worried about the failing checks, upstream is a little broken due to Pandas 3 changes, and coverage is fine for this situation.

It feels like the API design discussion has run it's course here. If @d-v-b or @dcherian have any more feedback it can happen in a follow up PR.

@melonora
Copy link
Copy Markdown
Contributor Author

Overall I'm happy with this change. Thanks for iterating so much here.

I'm not worried about the failing checks, upstream is a little broken due to Pandas 3 changes, and coverage is fine for this situation.

It feels like the API design discussion has run it's course here. If @d-v-b or @dcherian have any more feedback it can happen in a follow up PR.

No problem, happy to get this over the line. Thanks for reviewing everyone!

@jacobtomlinson jacobtomlinson merged commit 068be28 into dask:main Dec 10, 2025
22 of 24 checks passed
@jacobtomlinson jacobtomlinson changed the title Support dask sharding through create_array Support zarr sharding through create_array Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for sharding when storing dask arrays to zarr

6 participants