This is actually a bug reported back in #1695 that unfortunately went unnoticed.
In examples/stream/map_reduce.py, a list of streams was created for executing cupy.matmul() in parallel, which is backed by a ReductionKernel in this case:
|
cdef _tensordot_core_mul_sum = ReductionKernel( |
|
'S x, T y', 'U out', |
|
'static_cast<U>(x) * static_cast<U>(y)', |
|
'a + b', 'out = a', '0', '_tensordot_core_mul_sum') |
However, inspecting the implementation I found that
ReductionKernel only accepts an explicit
stream argument; it does not pick up any current stream:
|
stream = kwargs.pop('stream', None) |
In other words, that example was misleading because those streams were not used at all and so all executions were serialized, as can be checked from nvprof + nvvp (see the circle in red):
The same bug also appears in ElementwiseKernel:
|
stream = kwargs.pop('stream', None) |
In my opinion, unlike RawKernel which is not used by any CuPy core functionalities, ElementwiseKernel and ReductionKernel should honor the current stream by checking the current stream pointer if no stream argument is explicitly given, since many CuPy functions like cupy.matmul() do not support passing in a stream. A similar approach is already adopted in the FFT module, see #2362.
This is actually a bug reported back in #1695 that unfortunately went unnoticed.
In
examples/stream/map_reduce.py, a list of streams was created for executingcupy.matmul()in parallel, which is backed by aReductionKernelin this case:cupy/cupy/core/core.pyx
Lines 2513 to 2516 in 1af22f5
However, inspecting the implementation I found that
ReductionKernelonly accepts an explicitstreamargument; it does not pick up any current stream:cupy/cupy/core/reduction.pxi
Line 396 in 3271860
In other words, that example was misleading because those streams were not used at all and so all executions were serialized, as can be checked from nvprof + nvvp (see the circle in red):
The same bug also appears in
ElementwiseKernel:cupy/cupy/core/_kernel.pyx
Line 537 in 1af22f5
In my opinion, unlike
RawKernelwhich is not used by any CuPy core functionalities,ElementwiseKernelandReductionKernelshould honor the current stream by checking the current stream pointer if no stream argument is explicitly given, since many CuPy functions likecupy.matmul()do not support passing in a stream. A similar approach is already adopted in the FFT module, see #2362.