Skip to content

use shard shape when available in to_zarr#12105

Merged
jacobtomlinson merged 2 commits intodask:mainfrom
d-v-b:fix/use-shards-in-to_zarr
Oct 22, 2025
Merged

use shard shape when available in to_zarr#12105
jacobtomlinson merged 2 commits intodask:mainfrom
d-v-b:fix/use-shards-in-to_zarr

Conversation

@d-v-b
Copy link
Copy Markdown
Member

@d-v-b d-v-b commented Oct 21, 2025

This PR alters the behavior of to_zarr to use the shards attribute, when available, as the dask chunk shape when rechunking in to_zarr.

to_zarr currently rechunks to the chunk shape of the zarr array. Attempting to concurrently write chunks for a sharded array will lead to data loss and is not a safe default. For zarr v3 arrays that use sharding, the shards attribute and not the chunks is a the safe default, hence the changes in this PR.

@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 21, 2025

cc @TomAugspurger

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 21, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ±0        9 suites  ±0   3h 9m 11s ⏱️ - 2m 19s
 18 126 tests +1   16 911 ✅ +1   1 215 💤 ±0  0 ❌ ±0 
162 362 runs  +9  150 283 ✅ +6  12 079 💤 +3  0 ❌ ±0 

Results for commit 2ea63bb. ± Comparison against base commit 287f149.

♻️ This comment has been updated with latest results.

@TomAugspurger
Copy link
Copy Markdown
Member

TomAugspurger commented Oct 21, 2025

Thanks. I think it's worth documenting this behavior, since this increases the number of tasks and rechunk can be an expensive operation (though in this case, rechunking ought not be too bad since IIUC the chunks from shards should all be contained with the original chunks).

@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 21, 2025

since this increases the number of tasks

Does it? We are rechunking today to the chunk shape given by the chunks attribute of the zarr array. How should I estimate the additional tasks generated by rechunking to a larger shape (the shards attribute)? For dask arrays with larger chunks, the change in this PR seems like it might reduce the number of tasks.

@TomAugspurger
Copy link
Copy Markdown
Member

Nevermind. I think I just got the relationship between chunks and shards backwards (again) :/

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @d-v-b and thanks for the review @TomAugspurger

@jacobtomlinson jacobtomlinson merged commit d5b0716 into dask:main Oct 22, 2025
23 of 24 checks passed
@d-v-b d-v-b deleted the fix/use-shards-in-to_zarr branch October 22, 2025 11:17
@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 22, 2025

I do think @TomAugspurger is right that we need to document this behavior. Perhaps the best solution would be for dask to link to a dedicated "using zarr with parallel computing" section of the Zarr docs. I don't think we have one written yet.

@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 22, 2025

something for a future PR

@TomAugspurger
Copy link
Copy Markdown
Member

Makes sense, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants