Fix bug in CUB + support complex numbers using CUB#2508
Fix bug in CUB + support complex numbers using CUB#2508leofang wants to merge 4 commits intocupy:masterfrom
Conversation
|
The test script I used is basically taken from #2085 (comment) with minor changes: import cupy as cp
from contextlib import contextmanager
@contextmanager
def sync_time(name):
start = cp.cuda.Event()
end = cp.cuda.Event()
start.record()
#start.synchronize()
yield
end.record()
end.synchronize()
t = cp.cuda.get_elapsed_time(start,end)
print("{} : {} ms".format(name,t))
for dtype in (cp.complex64, cp.complex128):
for func in ('sum', 'max', 'min'):
print("testing", str(dtype), "+", func, "...")
x = cp.random.normal(size=((400, 32, 28, 28))).astype(dtype) \
+1j*cp.random.normal(size=((400, 32, 28, 28))).astype(dtype)
y = None
with sync_time("cupy"):
for i in range(1000):
y = getattr(x, func)()
x = cp.asnumpy(x) #move to cpu
z = None
with sync_time("numpy"):
for i in range(1000):
z = getattr(x, func)()
#print(str(dtype), func, y, z)
assert cp.allclose(y, z)For testing <class 'numpy.complex64'> + sum ...
cupy : 600.7633056640625 ms
numpy : 8711.4794921875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 1971.037353515625 ms
numpy : 21387.251953125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 1957.8563232421875 ms
numpy : 21416.21484375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1171.1514892578125 ms
numpy : 11070.486328125 ms
testing <class 'numpy.complex128'> + max ...
cupy : 2360.289794921875 ms
numpy : 22052.818359375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 2368.4697265625 ms
numpy : 22123.5 msFor testing <class 'numpy.complex64'> + sum ...
cupy : 9115.7255859375 ms
numpy : 8392.7216796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 12982.4951171875 ms
numpy : 21374.0 ms
testing <class 'numpy.complex64'> + min ...
cupy : 12394.900390625 ms
numpy : 21423.166015625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 10122.890625 ms
numpy : 11128.6943359375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 13595.556640625 ms
numpy : 22030.51953125 ms
testing <class 'numpy.complex128'> + min ...
cupy : 13326.609375 ms
numpy : 22058.287109375 msTesting environment:
$ conda list | grep mkl
blas 1.0 mkl defaults
mkl 2019.1 144 defaults
mkl-service 1.1.2 py36he904b0f_5 defaults
mkl_fft 1.0.10 py36ha843d7b_0 defaults
mkl_random 1.0.2 py36hd81dba3_0 defaults
$ python -c "import numpy; numpy.__config__.show()"
mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
blas_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
blas_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
lapack_mkl_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
lapack_opt_info:
libraries = ['mkl_rt', 'pthread']
library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
include_dirs = ['/home/leofang/conda_envs/cupy_dev/include'] |
toslunar
left a comment
There was a problem hiding this comment.
The enhancement of reduce_sum looks good to me after the minor fix.
My concern on reduce_(min|max) is that the performance may depend on input data. Could you compare the performance for edge cases e.g. a large constant array cupy.full((400, 32, 28, 28), 1+2j)?
| # (numpy/numpy#2004): test real part first, then imaginary | ||
| out_re = _reduce_min(x.real, out if out is None else out.real) | ||
| idx = (x.real == out_re).nonzero() | ||
| if len(idx) == x.ndim: |
There was a problem hiding this comment.
The condition looks always true. Is the following intended?
y = x[idx]
if len(y) == 1:There was a problem hiding this comment.
I think your comment above is a different facet of your question:
My concern on
reduce_(min|max)is that the performance may depend on input data. Could you compare the performance for edge cases e.g. a large constant arraycupy.full((400, 32, 28, 28), 1+2j)?
To answer them, first note that len(idx) == x.ndim is not always True. len(idx) is proportional to the number of elements whose real part is equal to out_re. So, in the worst case scenario as you suggested with a large constant array, len(idx)==N*x.dim (N is the total number of elements), and then we move on to compare the imaginary part of the entire array.
The if statement here is to handle the best case (with one hit). In a less ideal case (a few hits in the real part), the search space is greatly reduced in the next call of reduce_(min|max).
So, yes, in the worst case we search through the N-element array twice, and in this case we solely rely on CUB's clever optimization. With the input array x being replaced by x = cp.full((400, 32, 28, 28), 1+2j, dtype=dtype), we can see that while it takes a longer time on GPU (but CUB still outperforms the naive ufunc), the NumPy performance is even worse:
For CUB_DISABLED=0:
testing <class 'numpy.complex64'> + sum ...
cupy : 601.90087890625 ms
numpy : 11250.6796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 4847.3271484375 ms
numpy : 31283.07421875 ms
testing <class 'numpy.complex64'> + min ...
cupy : 4812.888671875 ms
numpy : 27644.96484375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1170.836181640625 ms
numpy : 11166.1201171875 ms
testing <class 'numpy.complex128'> + max ...
cupy : 5515.779296875 ms
numpy : 28532.884765625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 5516.4833984375 ms
numpy : 28526.55078125 msFor CUB_DISABLED=1:
testing <class 'numpy.complex64'> + sum ...
cupy : 9120.3193359375 ms
numpy : 8410.7587890625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 13439.9365234375 ms
numpy : 26741.98828125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 13067.576171875 ms
numpy : 26748.564453125 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 9805.685546875 ms
numpy : 10828.603515625 ms
testing <class 'numpy.complex128'> + max ...
cupy : 14115.16015625 ms
numpy : 27617.724609375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 13888.92578125 ms
numpy : 27661.27734375 msI think this is the reason we wanna use CUB.
|
Just FYI: It might be possible to implement native support of complex numbers in CUB by specializing two instances of |
This is a followup of #2090. The bulk of this PR is completed, but I mark it WIP so as to gather early feedbacks and comments before proceeding to write tests.
This PR does two things:
#include <stdexcept>incupy/cuda/cupy_cub.cu, I found that there is a build-time error:sum()/max()/min()for complex numbers using CUB. This support is compliant with NumPy behavior.The current implementation is, admittedly, an ugly and not fully optimized hack for two reasons: (i) CUB does not provide native support for complex numbers (see the CUB source code
cub/util_type.cuh), and (ii) since the comparison of two complex numbers is ill-defined, NumPy uses a lexical search formax()andmin(), which requires two passes (separately examining the real and imaginary parts) in the worst-case scenario.However, in my tests this hack still offers substantial speedup (see #2508 (comment)) so that I think it's worth considering. We can revisit this issue once CUB has better support.
Attn: @anaruse