Skip to content

ensure that the shard shape is used as the default chunk shape for sharded Zarr arrays#12104

Merged
dcherian merged 7 commits intodask:mainfrom
d-v-b:fix/better-zarr-chunk-default
Oct 29, 2025
Merged

ensure that the shard shape is used as the default chunk shape for sharded Zarr arrays#12104
dcherian merged 7 commits intodask:mainfrom
d-v-b:fix/better-zarr-chunk-default

Conversation

@d-v-b
Copy link
Copy Markdown
Member

@d-v-b d-v-b commented Oct 20, 2025

Zarr V3 added the shards attribute to Zarr arrays. When the shards attribute is not None it defines the smallest unit of the array that is safe to write concurrently (outside of exceptional circumstances). Thus for these Zarr arrays the default chunk size of a dask array should be the shard shape (or an integer multiple of the shard shape).

This PR changes the behavior of from_array to check for a shards attribute, and if it is present, uses that attribute as the chunks parameter if chunks was previously set to auto. This ensures that Zarr arrays have a default dask chunking that is safe for concurrent writes.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Oct 20, 2025

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

      9 files  ±0        9 suites  ±0   3h 12m 44s ⏱️ - 2m 36s
 18 127 tests +1   16 912 ✅ +1   1 215 💤 ±0  0 ❌ ±0 
162 371 runs  +9  150 285 ✅ +7  12 086 💤 +2  0 ❌ ±0 

Results for commit 8b636de. ± Comparison against base commit 0743e08.

♻️ This comment has been updated with latest results.

@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 21, 2025

cc @TomAugspurger

Copy link
Copy Markdown
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me. Looks like there are some conflicts to be resolved.

and (x.shards is not None)
and chunks == "auto"
):
chunks = x.shards
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question, and we'll want to document this.

and (x.shards is not None)
and chunks == "auto"
):
chunks = x.shards
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we assigning this to chunks rather than previous_chunks? I'd expect us to handle it the same as .chunks on a zarr v2 array.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous_chunks is a suggestion to normalize_chunks, which might be ignored in favor of satisfying other constraints like keeping below a memory target. But for sharded arrays, my judgment is that the shard shape defines a safe default value, which is a bit more strict than just a suggestion. Any multiple of the shard shape would work as well, but I don't think there's way to tell normalize_chunks "only use multiples of this chunk shape"?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't normalize_chunks determine a multiple of preferred_chunks for "auto"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the time it does, but it prioritizes the memory limit parameter over keeping chunks contiguous:

chunks = normalize_chunks(
    "auto", shape=(300,), previous_chunks=(100,), limit="13B", dtype=np.uint8
)
# chunks are split to fit under the memory limit
assert chunks == ((*((13,) * 23), 1,),)

That being said, I think using previous_chunks = shards is still a better default so I made that change.

Should we guard against the possibility that the chunk memory limit (which I think is fetched from the global config if limit=None) results in splitting shards? I don't think there's currently a way to say "no limit", but we could set the limit to the memory size of a single shard.

@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 29, 2025

Is there anything else we need to do here?

@jacobtomlinson
Copy link
Copy Markdown
Member

I'm happy for @dcherian to hit merge here if he has no further feedback.

@dcherian
Copy link
Copy Markdown
Collaborator

let's do it!

@dcherian dcherian merged commit 06e75c7 into dask:main Oct 29, 2025
23 of 24 checks passed
@d-v-b d-v-b deleted the fix/better-zarr-chunk-default branch October 29, 2025 16:32
@d-v-b
Copy link
Copy Markdown
Member Author

d-v-b commented Oct 29, 2025

thanks @dcherian!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants