improved the efficiency of matmul() by completely removing concatenation#8423
improved the efficiency of matmul() by completely removing concatenation#8423
matmul() by completely removing concatenation#8423Conversation
|
Can one of the admins verify this patch? |
7b1f63f to
6a1516d
Compare
|
Hi @dask I need some help debugging this PR for the unit-test The bug originates from using a list
EDIT: This bug has been fixed. Basically, I changed the line out = chunk.sum(a, dtype=dtype, axis=0)in item_type = type(a[0])
if item_type is np.ndarray:
out = chunk.sum(a, dtype=dtype, axis=0)
else:
a = [item.__array__() for item in a]
out = chunk.sum(a, dtype=dtype, axis=0)
out = item_type(out)A bit crude. But it passes the tests. Unfortunately, many CI tests are still breaking after GitHub crashed 2 days ago. |
273280d to
b2a0538
Compare
b2a0538 to
07f52a2
Compare
07f52a2 to
afcc474
Compare
|
Yay! CI tests are now operational. Thanks @dask admin! |
|
Sorry for the lack of response @ParticularMiner. I'll leave review to @ravwojdyla or maybe @eriknw |
ravwojdyla
left a comment
There was a problem hiding this comment.
great work @ParticularMiner, see some questions/comments below.
dask/array/routines.py
Outdated
| dtype = getattr(np.zeros(1, dtype=a.dtype).sum(), "dtype", object) | ||
|
|
||
| if a.shape[axis] == 1: | ||
| return a.reshape(_shape_minus_axis(a.shape, axis)) |
There was a problem hiding this comment.
This reshape could probably be just a squeeze with axis parameter, might be a bit more clear that way?
There was a problem hiding this comment.
Aha! You're precisely right. I forgot all about squeeze. I will change that.
dask/array/routines.py
Outdated
| if a.shape[axis] == 1: | ||
| return a.reshape(_shape_minus_axis(a.shape, axis)) | ||
|
|
||
| result = reduction( |
There was a problem hiding this comment.
Could probably be just return?
There was a problem hiding this comment.
Of course. Sorry for that.
dask/array/routines.py
Outdated
| return dot(a.conj().ravel(), b.ravel()) | ||
|
|
||
|
|
||
| def _shape_minus_axis(shape, axis): |
There was a problem hiding this comment.
I wonder if you even need this method and could instead just use squeeze more?
There was a problem hiding this comment.
Indeed. I'll be using squeeze instead. Thanks for this.
dask/array/routines.py
Outdated
| item_type = type(a[0]) | ||
| if item_type is np.ndarray: | ||
| out = chunk.sum(a, dtype=dtype, axis=0) | ||
| else: | ||
| a = [item.__array__() for item in a] | ||
| out = chunk.sum(a, dtype=dtype, axis=0) | ||
| out = item_type(out) |
There was a problem hiding this comment.
I'm probably missing something, but why does chunk.sum doesn't work here? could we let numpy delegate to the array types? __array__ might lead to numpy array materialisation right, which can be expensive? afair CuPy doesn't support __array__.
There was a problem hiding this comment.
I agree that there's a problem here. But I'm not sure how to fix it properly.
It turns out that chunk.sum works for the latest version of numpy so then this conditional branch is unnecessary. But the CI tests revealed that earlier versions of numpy raise a ValueError: setting an array element with a sequence.
I'll explore further. Thanks.
There was a problem hiding this comment.
I forgot to mention — the specific chunk type that raises the above exception is dask.array.tests.test_dispatch.EncapsulateNDArray found defined in dask/array/tests/test_dispatch.py
There was a problem hiding this comment.
I think you need to add squeeze to EncapsulateNDArray, see
dask/dask/array/tests/test_dispatch.py
Lines 84 to 87 in a5aecac
If we fix that what else is missing?
There was a problem hiding this comment.
I wanted to do that earlier but hesitated because I got the impression the creators wanted the EncapsulateNDArray interface to remain minimal as mentioned in the docstring for WrappedArray in dask/dask/array/tests/test_dispatch.py:
class WrappedArray(np.lib.mixins.NDArrayOperatorsMixin):
"""
Another mock duck array class (like EncapsulateNDArray), but
designed to be above Dask in the type casting hierarchy (that is,
WrappedArray wraps Dask Array) and be even more minimal in API.
Tests that Dask defers properly to upcast types.
"""Perhaps you feel that's not critical?
There was a problem hiding this comment.
It is my believe, that if we go forward with using squeeze in matmul (which arguable is a pretty common op), adding squeeze to that test WrappedArray is "justified".
There was a problem hiding this comment.
True. So should I add it and revert back to using numpy.sum instead of numpy.add with functools.reduce? Is there anyone who'd be unhappy with that?
There was a problem hiding this comment.
I'm +1, I would be interested to see the impact there, also to the other comment: #8423 (comment)
|
Using But unfortunately Besides, for the array type So I reluctantly have to revert to using I however retain usage of |
0ddfe1f to
fca4ad1
Compare
|
I have now replaced the following code in item_type = type(a[0])
if item_type is np.ndarray:
out = chunk.sum(a, dtype=dtype, axis=0)
else:
a = [item.__array__() for item in a]
out = chunk.sum(a, dtype=dtype, axis=0)
out = item_type(out)with this xp = np
if is_cupy_type(a[0]):
import cupy
xp = cupy
out = reduce(partial(xp.add, dtype=dtype), a)where reduce is imported from the package The latter code snippet seems to give a slight performance edge over the former making it more than 25% faster (compared to 20% by the former) than than the original |
dask/array/routines.py
Outdated
| else: | ||
| if a.shape == (0,): | ||
| return a |
There was a problem hiding this comment.
What is this case? And does it need to be a separate branch?
There was a problem hiding this comment.
This branch turns out to be necessary, as _chunk_sum is often called with an empty array a parameter passed in order to compute meta data — a peculiarity of dask.array.reduction
There was a problem hiding this comment.
@ParticularMiner if you remove that branch and let it go into the other branch, eventually into the reduce/add, what happens, how does it fail?
There was a problem hiding this comment.
I get an IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
at the reduce/add line when computing the meta data. Would you like the full traceback log?
There was a problem hiding this comment.
Here's the error log from one test:
tracebacklog.txt
There was a problem hiding this comment.
The problem is that the error occurs in dask outside any of the code that I wrote in this PR.
There was a problem hiding this comment.
@ParticularMiner thanks, oh, so this error happens only for a specific case: [(7,), (7,), (), ()],, which will reduce to a scalar, are you sure there isn't something going on here? 🤔
There was a problem hiding this comment.
Oh no. It's not the only case. I just uploaded the error for one case. There are at least 15 other cases where the error occurs for that test unit test_matmul().
dask/array/routines.py
Outdated
| if is_cupy_type(a[0]): | ||
| import cupy | ||
|
|
||
| xp = cupy |
There was a problem hiding this comment.
Why is this needed? wouldn't numpy handle the dispatch to cupy below?
There was a problem hiding this comment.
Honestly, I do not know. But the exact same construct is being used in _matmul() (which you allowed to remain), which is why I copied it over here.
There was a problem hiding this comment.
@ParticularMiner that part actually came after my change via #7563. @quasiben would appreciate your input here? When is that is_cupy_type needed?
There was a problem hiding this comment.
Sorry I assumed it was yours @ravwojdyla
There was a problem hiding this comment.
is_cupy_type is normally needed when we need to support the non-NumPy API part of CuPy, such as cupyx.scipy, this is why matmul requires that check. I think the reduction here may not need the same, or at least I can't think of a need for that right now. I would then suggest just replacing all this by np.add and it should work. If gpuCI tests pass we should be good to what we know is supported today.
@quasiben please feel free to correct me.
There was a problem hiding this comment.
Sorry @pentschev , I didn't see your comment above until just now. Nevertheless, I've already taken steps that agree with your comment. Thanks!
dask/array/routines.py
Outdated
| if keepdims: | ||
| return out | ||
| else: | ||
| return out.reshape(_shape_minus_axis(out.shape, axis[0])) |
There was a problem hiding this comment.
We can probably make the squeeze work (see the other comment), but otherwise if this method is used only once, probably not worth a complete new method.
There was a problem hiding this comment.
I agree. Let me know, after you read my response to adding a squeeze method to EncapsulatedNDArray, what you think.
|
Thanks for your detailed review on this @ravwojdyla ! I really appreciate that. It's getting to midnight here, so I'll likely respond to any more messages tomorrow. |
fca4ad1 to
f5cdb5a
Compare
7fbc017 to
485b4f2
Compare
|
You will find that I added the I guess the only outstanding issue is the one regarding the Thanks. |
|
@ParticularMiner perfect, thank you. I suspect that |
|
Sure @ravwojdyla. Thanks for the great suggestions.
It seems @jakirkham might also be able to address this question: As Lines 335 to 347 in 4324780 dask.array.matmul in #7563) whenever one needs to use the cupy implementation of some numpy function like numpy.add or numpy.sum?Is there a reason it would be less convenient to use a more centralized approach, like register the particular cupy function implementation in a "lookup registry", as has been done, for example, with concatenate for various chunk types including cupy itself?
Lines 113 to 129 in 4324780 |
ravwojdyla
left a comment
There was a problem hiding this comment.
Approving*, great work @ParticularMiner
* pending resolution of the question about the need for cupy check ( #8423 (comment)). @ParticularMiner that said if you prefer to merge sooner, and open an issue to discuss that branch/question etc, that sounds good to me too.
FYI @jsignell (and thanks for the ping).
|
That's fine, @ravwojdyla . It's not urgent to merge now.
|
|
@ravwojdyla thank you so much for reviewing this work and thank you @ParticularMiner for engaging! I'll broaden the ping on #8423 (comment) a bit to see if anyone on @dask/gpu knows if that branch is still necessary. |
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
8f68728 to
2dc309e
Compare
|
@ravwojdyla @jsignell |
|
I believe the observed test-failures for Python 3.7 on Windows are not related to this PR. (Correct me if I'm wrong.) |
|
I agree that the failing tests are unrelated. They are also showing up on main. |
|
Thanks for this work! The end diff looks super clean :) |
pre-commit run --all-filesThis PR avoids concatenation in both
blockwise()andreduction()within the implementation ofmatmul(). Previously,matmul()avoided concatenation only inblockwise().Due to the special nature of the preceding blockwise contraction, all chunk-operands of a sum during the reduction phase are expected to have exactly the same shape, with a size of 1 for the reduction axis. This makes mere element-wise addition of the chunks possible. Furthermore, the summation result can be merely reshaped to lose the reduction axis when requested. In principle, both these operations (element-wise addition and reshaping) are performance boosters compared to the default concatenation-mediated reduction which often involves data rearrangement.
The code of
matmul()has also been shortened, making it a bit more readable. This is a direct consequence of ensuring that the position of the new-axis introduced in the output of chunk-function_matmul()[the first argument toblockwise()] is consistent with its expected position in the output ofblockwise()(the position of the contraction-axis).Performance is comparable to the original
matmul()implementation in all cases except one, in which this new implementation is approximately 20% faster, namely, "the case where all dimensions are chunked". This outcome is consistent with the total absence of concatenation-overhead in the new version. Performance results can be found in this jupyter notebook.See discussion with @ravwojdyla at #7000 (comment) for the motivation of this PR.