Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 9 files ±0 9 suites ±0 3h 12m 44s ⏱️ - 2m 36s Results for commit 8b636de. ± Comparison against base commit 0743e08. ♻️ This comment has been updated with latest results. |
jacobtomlinson
left a comment
There was a problem hiding this comment.
This seems reasonable to me. Looks like there are some conflicts to be resolved.
…arr-chunk-default
dask/array/core.py
Outdated
| and (x.shards is not None) | ||
| and chunks == "auto" | ||
| ): | ||
| chunks = x.shards |
There was a problem hiding this comment.
@d-v-b There is an equivalent change to Xarray that would be good to make: https://github.com/pydata/xarray/blob/19f2973528df7e423aae85184f7c65bf7de9cb2e/xarray/backends/zarr.py#L858
TomAugspurger
left a comment
There was a problem hiding this comment.
One question, and we'll want to document this.
dask/array/core.py
Outdated
| and (x.shards is not None) | ||
| and chunks == "auto" | ||
| ): | ||
| chunks = x.shards |
There was a problem hiding this comment.
Why are we assigning this to chunks rather than previous_chunks? I'd expect us to handle it the same as .chunks on a zarr v2 array.
There was a problem hiding this comment.
previous_chunks is a suggestion to normalize_chunks, which might be ignored in favor of satisfying other constraints like keeping below a memory target. But for sharded arrays, my judgment is that the shard shape defines a safe default value, which is a bit more strict than just a suggestion. Any multiple of the shard shape would work as well, but I don't think there's way to tell normalize_chunks "only use multiples of this chunk shape"?
There was a problem hiding this comment.
Doesn't normalize_chunks determine a multiple of preferred_chunks for "auto"
There was a problem hiding this comment.
most of the time it does, but it prioritizes the memory limit parameter over keeping chunks contiguous:
chunks = normalize_chunks(
"auto", shape=(300,), previous_chunks=(100,), limit="13B", dtype=np.uint8
)
# chunks are split to fit under the memory limit
assert chunks == ((*((13,) * 23), 1,),)
That being said, I think using previous_chunks = shards is still a better default so I made that change.
Should we guard against the possibility that the chunk memory limit (which I think is fetched from the global config if limit=None) results in splitting shards? I don't think there's currently a way to say "no limit", but we could set the limit to the memory size of a single shard.
|
Is there anything else we need to do here? |
|
I'm happy for @dcherian to hit merge here if he has no further feedback. |
|
let's do it! |
|
thanks @dcherian! |
Zarr V3 added the
shardsattribute to Zarr arrays. When theshardsattribute is notNoneit defines the smallest unit of the array that is safe to write concurrently (outside of exceptional circumstances). Thus for these Zarr arrays the default chunk size of a dask array should be the shard shape (or an integer multiple of the shard shape).This PR changes the behavior of
from_arrayto check for ashardsattribute, and if it is present, uses that attribute as thechunksparameter ifchunkswas previously set toauto. This ensures that Zarr arrays have a default dask chunking that is safe for concurrent writes.