Skip to content

improved the efficiency of matmul() by completely removing concatenation#8423

Merged
jsignell merged 7 commits intodask:mainfrom
ParticularMiner:matmul
Dec 14, 2021
Merged

improved the efficiency of matmul() by completely removing concatenation#8423
jsignell merged 7 commits intodask:mainfrom
ParticularMiner:matmul

Conversation

@ParticularMiner
Copy link
Copy Markdown
Contributor

@ParticularMiner ParticularMiner commented Nov 27, 2021

This PR avoids concatenation in both blockwise() and reduction() within the implementation of matmul(). Previously, matmul() avoided concatenation only in blockwise().

Due to the special nature of the preceding blockwise contraction, all chunk-operands of a sum during the reduction phase are expected to have exactly the same shape, with a size of 1 for the reduction axis. This makes mere element-wise addition of the chunks possible. Furthermore, the summation result can be merely reshaped to lose the reduction axis when requested. In principle, both these operations (element-wise addition and reshaping) are performance boosters compared to the default concatenation-mediated reduction which often involves data rearrangement.

The code of matmul() has also been shortened, making it a bit more readable. This is a direct consequence of ensuring that the position of the new-axis introduced in the output of chunk-function _matmul() [the first argument to blockwise()] is consistent with its expected position in the output of blockwise() (the position of the contraction-axis).

Performance is comparable to the original matmul() implementation in all cases except one, in which this new implementation is approximately 20% faster, namely, "the case where all dimensions are chunked". This outcome is consistent with the total absence of concatenation-overhead in the new version. Performance results can be found in this jupyter notebook.

See discussion with @ravwojdyla at #7000 (comment) for the motivation of this PR.

@GPUtester
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

@github-actions github-actions bot added the array label Nov 27, 2021
@ParticularMiner ParticularMiner force-pushed the matmul branch 4 times, most recently from 7b1f63f to 6a1516d Compare November 28, 2021 08:16
@ParticularMiner
Copy link
Copy Markdown
Contributor Author

Attention: @dask, admins

Many CI tests are broken (but not because of this PR), as seen here (#8423) and elsewhere (#8424).

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

ParticularMiner commented Nov 28, 2021

Hi @dask

I need some help debugging this PR for the unit-test dask/array/tests/test_array_function.py::test_unregistered_func[<lambda>6] , which passes with numpy version 1.21.4, but not versions ≤ 1.19.5.

The bug originates from using a list a of arrays of type dask.array.tests.test_dispatch.EncapsulateNDArray innumpy.sum which is implemented by numpy.add.reduce(). I have not been able to determine what changed in numpy.add.reduce() between versions 1.19.5 and 1.21.4 that killed this bug. The error thrown is

ValueError: setting an array element with a sequence.

EDIT:

This bug has been fixed. Basically, I changed the line

        out = chunk.sum(a, dtype=dtype, axis=0)

in _chunk_sum() to

        item_type = type(a[0])
        if item_type is np.ndarray:
            out = chunk.sum(a, dtype=dtype, axis=0)
        else:
            a = [item.__array__() for item in a]
            out = chunk.sum(a, dtype=dtype, axis=0)
            out = item_type(out)

A bit crude. But it passes the tests.

Unfortunately, many CI tests are still breaking after GitHub crashed 2 days ago.

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

Yay! CI tests are now operational. Thanks @dask admin!

@jsignell
Copy link
Copy Markdown
Member

Sorry for the lack of response @ParticularMiner. I'll leave review to @ravwojdyla or maybe @eriknw

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work @ParticularMiner, see some questions/comments below.

dtype = getattr(np.zeros(1, dtype=a.dtype).sum(), "dtype", object)

if a.shape[axis] == 1:
return a.reshape(_shape_minus_axis(a.shape, axis))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reshape could probably be just a squeeze with axis parameter, might be a bit more clear that way?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha! You're precisely right. I forgot all about squeeze. I will change that.

if a.shape[axis] == 1:
return a.reshape(_shape_minus_axis(a.shape, axis))

result = reduction(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could probably be just return?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Sorry for that.

return dot(a.conj().ravel(), b.ravel())


def _shape_minus_axis(shape, axis):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you even need this method and could instead just use squeeze more?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. I'll be using squeeze instead. Thanks for this.

Comment on lines +352 to +358
item_type = type(a[0])
if item_type is np.ndarray:
out = chunk.sum(a, dtype=dtype, axis=0)
else:
a = [item.__array__() for item in a]
out = chunk.sum(a, dtype=dtype, axis=0)
out = item_type(out)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing something, but why does chunk.sum doesn't work here? could we let numpy delegate to the array types? __array__ might lead to numpy array materialisation right, which can be expensive? afair CuPy doesn't support __array__.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there's a problem here. But I'm not sure how to fix it properly.

It turns out that chunk.sum works for the latest version of numpy so then this conditional branch is unnecessary. But the CI tests revealed that earlier versions of numpy raise a ValueError: setting an array element with a sequence.

I'll explore further. Thanks.

Copy link
Copy Markdown
Contributor Author

@ParticularMiner ParticularMiner Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention — the specific chunk type that raises the above exception is dask.array.tests.test_dispatch.EncapsulateNDArray found defined in dask/array/tests/test_dispatch.py

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to add squeeze to EncapsulateNDArray, see

astype = wrap("astype")
sum = wrap("sum")
prod = wrap("prod")
reshape = wrap("reshape")

If we fix that what else is missing?

Copy link
Copy Markdown
Contributor Author

@ParticularMiner ParticularMiner Dec 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to do that earlier but hesitated because I got the impression the creators wanted the EncapsulateNDArray interface to remain minimal as mentioned in the docstring for WrappedArray in dask/dask/array/tests/test_dispatch.py:

class WrappedArray(np.lib.mixins.NDArrayOperatorsMixin):
    """
    Another mock duck array class (like EncapsulateNDArray), but
    designed to be above Dask in the type casting hierarchy (that is,
    WrappedArray wraps Dask Array) and be even more minimal in API.
    Tests that Dask defers properly to upcast types.
    """

Perhaps you feel that's not critical?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my believe, that if we go forward with using squeeze in matmul (which arguable is a pretty common op), adding squeeze to that test WrappedArray is "justified".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. So should I add it and revert back to using numpy.sum instead of numpy.add with functools.reduce? Is there anyone who'd be unhappy with that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1, I would be interested to see the impact there, also to the other comment: #8423 (comment)

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

ParticularMiner commented Dec 2, 2021

@ravwojdyla

Using squeeze() in _chunk_sum() would have been my preference.

But unfortunately squeeze() raised an error for numpy arrays whose shape had values like (0, 0, 0). This is because squeeze() always expects an axis of size 1.

Besides, for the array type dask.array.tests.test_dispatch.EncapsulateNDArray (found defined in dask/array/tests/test_dispatch.py) squeeze() raises AttributeError: 'EncapsulateNDArray' object has no attribute 'squeeze'.

So I reluctantly have to revert to using reshape() instead.

I however retain usage of squeeze() everywhere else.

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

@ravwojdyla

I have now replaced the following code in _chunk_sum():

        item_type = type(a[0])
        if item_type is np.ndarray:
            out = chunk.sum(a, dtype=dtype, axis=0)
        else:
            a = [item.__array__() for item in a]
            out = chunk.sum(a, dtype=dtype, axis=0)
            out = item_type(out)

with this

        xp = np

        if is_cupy_type(a[0]):
            import cupy

            xp = cupy

        out = reduce(partial(xp.add, dtype=dtype), a)

where reduce is imported from the package functools. Fortunately, all CI tests pass with these changes. Moreover, just like in _matmul() I explicitly test for the existence of a cupy type.

The latter code snippet seems to give a slight performance edge over the former making it more than 25% faster (compared to 20% by the former) than than the original matmul in the case where all dimensions are chunked.

Comment on lines +360 to +356
else:
if a.shape == (0,):
return a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this case? And does it need to be a separate branch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch turns out to be necessary, as _chunk_sum is often called with an empty array a parameter passed in order to compute meta data — a peculiarity of dask.array.reduction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParticularMiner if you remove that branch and let it go into the other branch, eventually into the reduce/add, what happens, how does it fail?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
at the reduce/add line when computing the meta data. Would you like the full traceback log?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the error log from one test:
tracebacklog.txt

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the error occurs in dask outside any of the code that I wrote in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParticularMiner thanks, oh, so this error happens only for a specific case: [(7,), (7,), (), ()],, which will reduce to a scalar, are you sure there isn't something going on here? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no. It's not the only case. I just uploaded the error for one case. There are at least 15 other cases where the error occurs for that test unit test_matmul().

Comment on lines +354 to +351
if is_cupy_type(a[0]):
import cupy

xp = cupy
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? wouldn't numpy handle the dispatch to cupy below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I do not know. But the exact same construct is being used in _matmul() (which you allowed to remain), which is why I copied it over here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParticularMiner that part actually came after my change via #7563. @quasiben would appreciate your input here? When is that is_cupy_type needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I assumed it was yours @ravwojdyla

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_cupy_type is normally needed when we need to support the non-NumPy API part of CuPy, such as cupyx.scipy, this is why matmul requires that check. I think the reduction here may not need the same, or at least I can't think of a need for that right now. I would then suggest just replacing all this by np.add and it should work. If gpuCI tests pass we should be good to what we know is supported today.

@quasiben please feel free to correct me.

Copy link
Copy Markdown
Contributor Author

@ParticularMiner ParticularMiner Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @pentschev , I didn't see your comment above until just now. Nevertheless, I've already taken steps that agree with your comment. Thanks!

if keepdims:
return out
else:
return out.reshape(_shape_minus_axis(out.shape, axis[0]))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably make the squeeze work (see the other comment), but otherwise if this method is used only once, probably not worth a complete new method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Let me know, after you read my response to adding a squeeze method to EncapsulatedNDArray, what you think.

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

Thanks for your detailed review on this @ravwojdyla ! I really appreciate that.

It's getting to midnight here, so I'll likely respond to any more messages tomorrow.

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

@ravwojdyla

You will find that I added the squeeze() method to EncapsulateNDArray, discarded shape_minus_axis together with one conditional branch of _chunk_sum. Everything seems to work well with these changes. Performance seems to be even somewhat faster than before.

I guess the only outstanding issue is the one regarding the cupy type array.

Thanks.

@ravwojdyla
Copy link
Copy Markdown
Contributor

@ParticularMiner perfect, thank you. I suspect that cupy branch is not needed, but there's a chance I don't fully understand why it would be there, maybe let's give @quasiben some time to respond to the comment in #8423 (comment)

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

Sure @ravwojdyla. Thanks for the great suggestions.

I suspect that cupy branch is not needed, but there's a chance I don't fully understand why it would be there, maybe let's give @quasiben some time to respond to the comment in #8423 (comment)

It seems @jakirkham might also be able to address this question:

As dask.array is right now, is it necessary to test for the cupy chunk type (as @quasiben introduced:

dask/dask/array/routines.py

Lines 335 to 347 in 4324780

def _matmul(a, b):
xp = np
if is_cupy_type(a):
import cupy
xp = cupy
chunk = xp.matmul(a, b)
# Since we have performed the contraction via matmul
# but blockwise expects all dimensions back, we need
# to add one dummy dimension back
return chunk[..., xp.newaxis]
for dask.array.matmul in #7563) whenever one needs to use the cupy implementation of some numpy function like numpy.add or numpy.sum?
Is there a reason it would be less convenient to use a more centralized approach, like register the particular cupy function implementation in a "lookup registry", as has been done, for example, with concatenate for various chunk types including cupy itself?

dask/dask/array/backends.py

Lines 113 to 129 in 4324780

@tensordot_lookup.register_lazy("cupy")
@concatenate_lookup.register_lazy("cupy")
def register_cupy():
import cupy
from dask.array.dispatch import percentile_lookup
concatenate_lookup.register(cupy.ndarray, cupy.concatenate)
tensordot_lookup.register(cupy.ndarray, cupy.tensordot)
percentile_lookup.register(cupy.ndarray, percentile)
@einsum_lookup.register(cupy.ndarray)
def _cupy_einsum(*args, **kwargs):
# NB: cupy does not accept `order` or `casting` kwargs - ignore
kwargs.pop("casting", None)
kwargs.pop("order", None)
return cupy.einsum(*args, **kwargs)

Copy link
Copy Markdown
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving*, great work @ParticularMiner

* pending resolution of the question about the need for cupy check ( #8423 (comment)). @ParticularMiner that said if you prefer to merge sooner, and open an issue to discuss that branch/question etc, that sounds good to me too.

FYI @jsignell (and thanks for the ping).

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

That's fine, @ravwojdyla . It's not urgent to merge now.

  • pending resolution of the question about the need for cupy check ( #8423 (comment)). @ParticularMiner that said if you prefer to merge sooner, and open an issue to discuss that branch/question etc, that sounds good to me too.

@jsignell
Copy link
Copy Markdown
Member

@ravwojdyla thank you so much for reviewing this work and thank you @ParticularMiner for engaging! I'll broaden the ping on #8423 (comment) a bit to see if anyone on @dask/gpu knows if that branch is still necessary.

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

@ravwojdyla
Thanks again for clearing things up!

@jsignell
It appears all outstanding issues here have been addressed, and this PR is now awaiting your deployment.
Many thanks for your effective oversight!

@ParticularMiner
Copy link
Copy Markdown
Contributor Author

I believe the observed test-failures for Python 3.7 on Windows are not related to this PR. (Correct me if I'm wrong.)

@jsignell
Copy link
Copy Markdown
Member

I agree that the failing tests are unrelated. They are also showing up on main.

@jsignell jsignell merged commit 33b2886 into dask:main Dec 14, 2021
@jsignell
Copy link
Copy Markdown
Member

Thanks for this work! The end diff looks super clean :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants