Skip to content

Fix bug in CUB + support complex numbers using CUB#2508

Closed
leofang wants to merge 4 commits intocupy:masterfrom
leofang:cub_complex
Closed

Fix bug in CUB + support complex numbers using CUB#2508
leofang wants to merge 4 commits intocupy:masterfrom
leofang:cub_complex

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Sep 29, 2019

This is a followup of #2090. The bulk of this PR is completed, but I mark it WIP so as to gather early feedbacks and comments before proceeding to write tests.

This PR does two things:

  1. Fix build-time bug: without #include <stdexcept> in cupy/cuda/cupy_cub.cu, I found that there is a build-time error:
    building 'cupy.cuda.cub' extension
    gcc -pthread -B /home/leofang/conda_envs/cupy_dev/compiler_compat -Wl,--sysroot=/ -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/home/leofang/.nccl_2.1.15-1+cuda9.1_x86_64/include/ -fPIC -D_FORCE_INLINES=1 -I/usr/local/cuda/include -I/home/leofang/sources/cub-1.8.0/ -I/home/leofang/conda_envs/cupy_dev/include/python3.6m -c cupy/cuda/cub.cpp -o build/temp.linux-x86_64-3.6/cupy/cuda/cub.o
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    NVCC options: ['--generate-code=arch=compute_30,code=compute_30', '--generate-code=arch=compute_50,code=compute_50', '--generate-code=arch=compute_60,code=sm_60', '--generate-code=arch=compute_61,code=sm_61', '--generate-code=arch=compute_70,code=sm_70', '--generate-code=arch=compute_70,code=compute_70', '-O2', '--compiler-options="-fPIC"', '--std=c++11']
    /usr/local/cuda/bin/nvcc -D_FORCE_INLINES=1 -I/usr/local/cuda/include -I/home/leofang/sources/cub-1.8.0/ -I/home/leofang/conda_envs/cupy_dev/include/python3.6m -c cupy/cuda/cupy_cub.cu -o build/temp.linux-x86_64-3.6/cupy/cuda/cupy_cub.o --generate-code=arch=compute_30,code=compute_30 --generate-code=arch=compute_50,code=compute_50 --generate-code=arch=compute_60,code=sm_60 --generate-code=arch=compute_61,code=sm_61 --generate-code=arch=compute_70,code=sm_70 --generate-code=arch=compute_70,code=compute_70 -O2 --compiler-options="-fPIC" --std=c++11
    cupy/cuda/cupy_cub.cu(28): error: namespace "std" has no member "runtime_error"
  1. Support sum()/max()/min() for complex numbers using CUB. This support is compliant with NumPy behavior.

The current implementation is, admittedly, an ugly and not fully optimized hack for two reasons: (i) CUB does not provide native support for complex numbers (see the CUB source code cub/util_type.cuh), and (ii) since the comparison of two complex numbers is ill-defined, NumPy uses a lexical search for max() and min(), which requires two passes (separately examining the real and imaginary parts) in the worst-case scenario.

However, in my tests this hack still offers substantial speedup (see #2508 (comment)) so that I think it's worth considering. We can revisit this issue once CUB has better support.

Attn: @anaruse

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Sep 29, 2019

The test script I used is basically taken from #2085 (comment) with minor changes:

import cupy as cp
from contextlib import contextmanager


@contextmanager
def sync_time(name):
    start = cp.cuda.Event()
    end = cp.cuda.Event()
    start.record()
    #start.synchronize()
    yield
    end.record()
    end.synchronize()
    t = cp.cuda.get_elapsed_time(start,end)
    print("{} : {} ms".format(name,t))


for dtype in (cp.complex64, cp.complex128):
    for func in ('sum', 'max', 'min'):
        print("testing", str(dtype), "+", func, "...")
        x = cp.random.normal(size=((400, 32, 28, 28))).astype(dtype) \
            +1j*cp.random.normal(size=((400, 32, 28, 28))).astype(dtype)

        y = None
        with sync_time("cupy"):
            for i in range(1000):
                y = getattr(x, func)()

        x = cp.asnumpy(x) #move to cpu

        z = None
        with sync_time("numpy"):
            for i in range(1000):
                z = getattr(x, func)()

        #print(str(dtype), func, y, z)
        assert cp.allclose(y, z)

For CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 600.7633056640625 ms
numpy : 8711.4794921875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 1971.037353515625 ms
numpy : 21387.251953125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 1957.8563232421875 ms
numpy : 21416.21484375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1171.1514892578125 ms
numpy : 11070.486328125 ms
testing <class 'numpy.complex128'> + max ...
cupy : 2360.289794921875 ms
numpy : 22052.818359375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 2368.4697265625 ms
numpy : 22123.5 ms

For CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 9115.7255859375 ms
numpy : 8392.7216796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 12982.4951171875 ms
numpy : 21374.0 ms
testing <class 'numpy.complex64'> + min ...
cupy : 12394.900390625 ms
numpy : 21423.166015625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 10122.890625 ms
numpy : 11128.6943359375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 13595.556640625 ms
numpy : 22030.51953125 ms
testing <class 'numpy.complex128'> + min ...
cupy : 13326.609375 ms
numpy : 22058.287109375 ms

Testing environment:

  • P100
  • Python 3.6.5
  • CuPy v7.0.0b4
  • NumPy 1.15.4:
$ conda list | grep mkl
blas                      1.0                         mkl    defaults
mkl                       2019.1                      144    defaults
mkl-service               1.1.2            py36he904b0f_5    defaults
mkl_fft                   1.0.10           py36ha843d7b_0    defaults
mkl_random                1.0.2            py36hd81dba3_0    defaults

$ python -c "import numpy; numpy.__config__.show()"
mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Sep 30, 2019

@niboshi @anaruse I am confused. Was Jenkins configured to test #2090? Are we using the same testsuite for CUB? Shouldn't we test both the original backed and the CUB backend, as the latter, once activated, would shadow the former? I didn't see any tailored tests in #2090.

@leofang leofang changed the title [WIP] Fix bug in CUB + support complex numbers using CUB Fix bug in CUB + support complex numbers using CUB Sep 30, 2019
@leofang leofang marked this pull request as ready for review September 30, 2019 02:34
@toslunar toslunar self-assigned this Sep 30, 2019
Copy link
Copy Markdown
Member

@toslunar toslunar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enhancement of reduce_sum looks good to me after the minor fix.

My concern on reduce_(min|max) is that the performance may depend on input data. Could you compare the performance for edge cases e.g. a large constant array cupy.full((400, 32, 28, 28), 1+2j)?

# (numpy/numpy#2004): test real part first, then imaginary
out_re = _reduce_min(x.real, out if out is None else out.real)
idx = (x.real == out_re).nonzero()
if len(idx) == x.ndim:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition looks always true. Is the following intended?

y = x[idx]
if len(y) == 1:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your comment above is a different facet of your question:

My concern on reduce_(min|max) is that the performance may depend on input data. Could you compare the performance for edge cases e.g. a large constant array cupy.full((400, 32, 28, 28), 1+2j)?

To answer them, first note that len(idx) == x.ndim is not always True. len(idx) is proportional to the number of elements whose real part is equal to out_re. So, in the worst case scenario as you suggested with a large constant array, len(idx)==N*x.dim (N is the total number of elements), and then we move on to compare the imaginary part of the entire array.

The if statement here is to handle the best case (with one hit). In a less ideal case (a few hits in the real part), the search space is greatly reduced in the next call of reduce_(min|max).

So, yes, in the worst case we search through the N-element array twice, and in this case we solely rely on CUB's clever optimization. With the input array x being replaced by x = cp.full((400, 32, 28, 28), 1+2j, dtype=dtype), we can see that while it takes a longer time on GPU (but CUB still outperforms the naive ufunc), the NumPy performance is even worse:

For CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 601.90087890625 ms
numpy : 11250.6796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 4847.3271484375 ms
numpy : 31283.07421875 ms
testing <class 'numpy.complex64'> + min ...
cupy : 4812.888671875 ms
numpy : 27644.96484375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1170.836181640625 ms
numpy : 11166.1201171875 ms
testing <class 'numpy.complex128'> + max ...
cupy : 5515.779296875 ms
numpy : 28532.884765625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 5516.4833984375 ms
numpy : 28526.55078125 ms

For CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 9120.3193359375 ms
numpy : 8410.7587890625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 13439.9365234375 ms
numpy : 26741.98828125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 13067.576171875 ms
numpy : 26748.564453125 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 9805.685546875 ms
numpy : 10828.603515625 ms
testing <class 'numpy.complex128'> + max ...
cupy : 14115.16015625 ms
numpy : 27617.724609375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 13888.92578125 ms
numpy : 27661.27734375 ms

I think this is the reason we wanna use CUB.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 7, 2019

Just FYI: It might be possible to implement native support of complex numbers in CUB by specializing two instances of NumericTraits<complex<float>> and NumericTraits<complex<double>>. However, until I resolve #2530 I have no way to experiment this approach locally, so let us stick to this PR for the moment.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 9, 2019

It might be possible to implement native support of complex numbers in CUB by specializing two instances of NumericTraits<complex<float>> and NumericTraits<complex<double>>.

This is achieved in #2538, which is in favor. @toslunar, this PR is closed to avoid duplicate review effort.

@leofang leofang closed this Oct 13, 2019
@leofang leofang deleted the cub_complex branch October 19, 2019 03:43
@kmaehashi kmaehashi added this to the Closed issues and PRs milestone Oct 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants