Speed up ``normalize_chunks`` for common case by martindurant · Pull Request #10579 · dask/dask

martindurant · 2023-10-19T17:13:48Z

In certain xarray zarr/kerchunk reads with many variables, a considerable amount of time is spent in dask.array's normalize chunks (see snakeviz output)

For the case that a zarr-like chunksize was given, or a full chunks spec was already given, this is all unnecessary work. This PR skips it for such cases. The same part of the profile following the change looks like:

martindurant · 2023-10-23T21:05:30Z

bump @dask/dask-dev

rsignell-usgs · 2023-10-24T16:13:59Z

@jrbourbeau this would really help speed up opening the 15TB coastal ocean dataset USGS has been working on. From 20s to 6s! It's headed for the AWS public dataset program!

jrbourbeau · 2023-10-24T16:30:43Z

Nice, thanks @martindurant -- taking a look now

jrbourbeau · 2023-10-24T16:27:54Z

dask/array/core.py

+    if nonans or isinstance(sum(sum(_) for _ in chunks), int):
+        return tuple(tuple(_) for _ in chunks)


Non-blocking nit: The _ usage here is correct, but looks a little unusual. Any reason we can't do this, like we do elsewhere?

Suggested change

if nonans or isinstance(sum(sum(_) for _ in chunks), int):

return tuple(tuple(_) for _ in chunks)

if nonans or isinstance(sum(sum(c) for c in chunks), int):

return tuple(tuple(c) for c in chunks)

jrbourbeau · 2023-10-24T16:34:00Z

dask/array/core.py


+    nonans = None
    if chunks and shape is not None:
+        nonans = all(isinstance(c, int) for c in chunks)


Is there a reason to use an int check instead of not math.isnan(c)? The nonans name seems to be slightly different than the code that's in this line

The c in this case can be tuples - we are checking specifically for individual ints.

Also, it's a bit faster eve

In [9]: l = [1] * 100 In [10]: %timeit all(isinstance(x, int) for x in l) 3.41 µs ± 14.2 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [11]: %timeit all(not math.isnan(x) for x in l) 5.55 µs ± 8.38 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

jrbourbeau · 2023-10-24T16:52:07Z

dask/array/core.py

                "Got chunks=%s, shape=%s" % (chunks, shape)
            )
-
+    if nonans or isinstance(sum(sum(_) for _ in chunks), int):


What case is the or isinstance(sum(sum(_) for _ in chunks), int) part covering that's not already covered by the if nonans portion of this condition?

I'm surprised this double summation isn't slow in the many data variables case

It's the nonans (now renamed allints) which is the very fast path.

However, the sum is still much faster. In the case there are nans, it would be slightly slower.

In [16]: chunks = ((1, 2, 3, 4), ) * 10 In [18]: %timeit tuple(tuple(int(x) if not math.isnan(x) else np.nan for x in c) for c in chunks) 5.15 µs ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [19]: %timeit isinstance(sum(sum(_) for _ in chunks), int);tuple(tuple(ch) for ch in chunks) 1.34 µs ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

What case is the or isinstance(sum(sum(_) for _ in chunks), int) part covering

This is for chunks like ((2, 2, 2), (3, 3, 3)), i.e., tuple of tuple of only ints.

As you see in the original snakeviz, this function is called three times under from_array; we can speed up inputs like (2, 3) (original zarr-like chunksize) and ((2, 2, 2), (3, 3, 3)) (already normalised). I haven't dug to see if there's a way to avoid calling normalize at all, since the code here is enough to have it drop out of my profiling.

dask/array/core.py

martindurant · 2023-10-24T17:31:59Z

@jrbourbeau , thanks for the review

I implemented your suggestions and answered questions.
There may be a better way to avoid the time spent in this one function, but I didn't want to make changes further up the stack where they might have more consequences. I don't really understand why we have nans here at all, or why they might be something other than np.nan. For the case where we aren't using a chunksize, I don't really know why running this function is necessary at all and wouldn't have thought about it without the profiling.

martindurant · 2023-10-24T18:00:42Z

Removing my stuff and simply using

    return tuple(tuple(ch) for ch in chunks)

on the last line would be much better, but there are two tests specifically passing float("nan") to check the current behaviour.

rsignell · 2023-10-25T18:30:19Z

@jrbourbeau are you satisfied with the approach and justifications for the approach or do you think it needs more work?

jrbourbeau

Thanks @martindurant! This is in

Speed up normalize_chunks for common case

930378b

github-actions bot added the array label Oct 19, 2023

martindurant mentioned this pull request Oct 19, 2023

Concatenate arrays with varchunks fsspec/kerchunk#374

Draft

9 tasks

dcherian mentioned this pull request Oct 19, 2023

Reduce dask tokenization time pydata/xarray#8339

Merged

martindurant mentioned this pull request Oct 20, 2023

open in xarray without dask? intake/intake-xarray#138

Closed

jrbourbeau reviewed Oct 24, 2023

View reviewed changes

jrbourbeau changed the title ~~Speed up normalize_chunks for common case~~ Speed up normalize_chunks for common case Oct 24, 2023

martindurant added 2 commits October 24, 2023 13:04

Merge branch 'main' into array_norm_chunks_opt

f95ecd6

rename variable and add comments

6e43025

dcherian mentioned this pull request Oct 24, 2023

xarray.open_zarr() takes too long to lazy load when the data arrays contain a large number of Dask chunks. pydata/xarray#6036

Closed

jrbourbeau approved these changes Oct 25, 2023

View reviewed changes

jrbourbeau merged commit 009489f into dask:main Oct 25, 2023

		if nonans or isinstance(sum(sum(_) for _ in chunks), int):
		return tuple(tuple(_) for _ in chunks)

Uh oh!

Conversation

martindurant commented Oct 19, 2023

Uh oh!

martindurant commented Oct 23, 2023

Uh oh!

rsignell-usgs commented Oct 24, 2023

Uh oh!

jrbourbeau commented Oct 24, 2023

Uh oh!

jrbourbeau Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

martindurant Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

jrbourbeau Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

martindurant Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

martindurant Oct 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

martindurant commented Oct 24, 2023

Uh oh!

martindurant commented Oct 24, 2023

Uh oh!

rsignell commented Oct 25, 2023

Uh oh!

jrbourbeau left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants