Fix bug in CUB + support complex numbers using CUB by leofang · Pull Request #2508 · cupy/cupy

leofang · 2019-09-29T02:51:10Z

This is a followup of #2090. The bulk of this PR is completed, but I mark it WIP so as to gather early feedbacks and comments before proceeding to write tests.

This PR does two things:

Fix build-time bug: without #include <stdexcept> in cupy/cuda/cupy_cub.cu, I found that there is a build-time error:

    building 'cupy.cuda.cub' extension
    gcc -pthread -B /home/leofang/conda_envs/cupy_dev/compiler_compat -Wl,--sysroot=/ -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/home/leofang/.nccl_2.1.15-1+cuda9.1_x86_64/include/ -fPIC -D_FORCE_INLINES=1 -I/usr/local/cuda/include -I/home/leofang/sources/cub-1.8.0/ -I/home/leofang/conda_envs/cupy_dev/include/python3.6m -c cupy/cuda/cub.cpp -o build/temp.linux-x86_64-3.6/cupy/cuda/cub.o
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    NVCC options: ['--generate-code=arch=compute_30,code=compute_30', '--generate-code=arch=compute_50,code=compute_50', '--generate-code=arch=compute_60,code=sm_60', '--generate-code=arch=compute_61,code=sm_61', '--generate-code=arch=compute_70,code=sm_70', '--generate-code=arch=compute_70,code=compute_70', '-O2', '--compiler-options="-fPIC"', '--std=c++11']
    /usr/local/cuda/bin/nvcc -D_FORCE_INLINES=1 -I/usr/local/cuda/include -I/home/leofang/sources/cub-1.8.0/ -I/home/leofang/conda_envs/cupy_dev/include/python3.6m -c cupy/cuda/cupy_cub.cu -o build/temp.linux-x86_64-3.6/cupy/cuda/cupy_cub.o --generate-code=arch=compute_30,code=compute_30 --generate-code=arch=compute_50,code=compute_50 --generate-code=arch=compute_60,code=sm_60 --generate-code=arch=compute_61,code=sm_61 --generate-code=arch=compute_70,code=sm_70 --generate-code=arch=compute_70,code=compute_70 -O2 --compiler-options="-fPIC" --std=c++11
    cupy/cuda/cupy_cub.cu(28): error: namespace "std" has no member "runtime_error"

Support sum()/max()/min() for complex numbers using CUB. This support is compliant with NumPy behavior.

The current implementation is, admittedly, an ugly and not fully optimized hack for two reasons: (i) CUB does not provide native support for complex numbers (see the CUB source code cub/util_type.cuh), and (ii) since the comparison of two complex numbers is ill-defined, NumPy uses a lexical search for max() and min(), which requires two passes (separately examining the real and imaginary parts) in the worst-case scenario.

However, in my tests this hack still offers substantial speedup (see #2508 (comment)) so that I think it's worth considering. We can revisit this issue once CUB has better support.

Attn: @anaruse

leofang · 2019-09-29T03:22:35Z

The test script I used is basically taken from #2085 (comment) with minor changes:

import cupy as cp
from contextlib import contextmanager


@contextmanager
def sync_time(name):
    start = cp.cuda.Event()
    end = cp.cuda.Event()
    start.record()
    #start.synchronize()
    yield
    end.record()
    end.synchronize()
    t = cp.cuda.get_elapsed_time(start,end)
    print("{} : {} ms".format(name,t))


for dtype in (cp.complex64, cp.complex128):
    for func in ('sum', 'max', 'min'):
        print("testing", str(dtype), "+", func, "...")
        x = cp.random.normal(size=((400, 32, 28, 28))).astype(dtype) \
            +1j*cp.random.normal(size=((400, 32, 28, 28))).astype(dtype)

        y = None
        with sync_time("cupy"):
            for i in range(1000):
                y = getattr(x, func)()

        x = cp.asnumpy(x) #move to cpu

        z = None
        with sync_time("numpy"):
            for i in range(1000):
                z = getattr(x, func)()

        #print(str(dtype), func, y, z)
        assert cp.allclose(y, z)

For CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 600.7633056640625 ms
numpy : 8711.4794921875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 1971.037353515625 ms
numpy : 21387.251953125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 1957.8563232421875 ms
numpy : 21416.21484375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1171.1514892578125 ms
numpy : 11070.486328125 ms
testing <class 'numpy.complex128'> + max ...
cupy : 2360.289794921875 ms
numpy : 22052.818359375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 2368.4697265625 ms
numpy : 22123.5 ms

For CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 9115.7255859375 ms
numpy : 8392.7216796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 12982.4951171875 ms
numpy : 21374.0 ms
testing <class 'numpy.complex64'> + min ...
cupy : 12394.900390625 ms
numpy : 21423.166015625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 10122.890625 ms
numpy : 11128.6943359375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 13595.556640625 ms
numpy : 22030.51953125 ms
testing <class 'numpy.complex128'> + min ...
cupy : 13326.609375 ms
numpy : 22058.287109375 ms

Testing environment:

P100
Python 3.6.5
CuPy v7.0.0b4
NumPy 1.15.4:

$ conda list | grep mkl
blas                      1.0                         mkl    defaults
mkl                       2019.1                      144    defaults
mkl-service               1.1.2            py36he904b0f_5    defaults
mkl_fft                   1.0.10           py36ha843d7b_0    defaults
mkl_random                1.0.2            py36hd81dba3_0    defaults

$ python -c "import numpy; numpy.__config__.show()"
mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/home/leofang/conda_envs/cupy_dev/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/leofang/conda_envs/cupy_dev/include']

leofang · 2019-09-30T01:48:04Z

@niboshi @anaruse I am confused. Was Jenkins configured to test #2090? Are we using the same testsuite for CUB? Shouldn't we test both the original backed and the CUB backend, as the latter, once activated, would shadow the former? I didn't see any tailored tests in #2090.

toslunar

The enhancement of reduce_sum looks good to me after the minor fix.

My concern on reduce_(min|max) is that the performance may depend on input data. Could you compare the performance for edge cases e.g. a large constant array cupy.full((400, 32, 28, 28), 1+2j)?

cupy/cuda/cub.pyx

toslunar · 2019-10-02T08:10:06Z

cupy/cuda/cub.pyx

+        # (numpy/numpy#2004): test real part first, then imaginary
+        out_re = _reduce_min(x.real, out if out is None else out.real)
+        idx = (x.real == out_re).nonzero()
+        if len(idx) == x.ndim:


The condition looks always true. Is the following intended?

y = x[idx] if len(y) == 1:

I think your comment above is a different facet of your question:

My concern on reduce_(min|max) is that the performance may depend on input data. Could you compare the performance for edge cases e.g. a large constant array cupy.full((400, 32, 28, 28), 1+2j)?

To answer them, first note that len(idx) == x.ndim is not always True. len(idx) is proportional to the number of elements whose real part is equal to out_re. So, in the worst case scenario as you suggested with a large constant array, len(idx)==N*x.dim (N is the total number of elements), and then we move on to compare the imaginary part of the entire array.

The if statement here is to handle the best case (with one hit). In a less ideal case (a few hits in the real part), the search space is greatly reduced in the next call of reduce_(min|max).

So, yes, in the worst case we search through the N-element array twice, and in this case we solely rely on CUB's clever optimization. With the input array x being replaced by x = cp.full((400, 32, 28, 28), 1+2j, dtype=dtype), we can see that while it takes a longer time on GPU (but CUB still outperforms the naive ufunc), the NumPy performance is even worse:

For CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ... cupy : 601.90087890625 ms numpy : 11250.6796875 ms testing <class 'numpy.complex64'> + max ... cupy : 4847.3271484375 ms numpy : 31283.07421875 ms testing <class 'numpy.complex64'> + min ... cupy : 4812.888671875 ms numpy : 27644.96484375 ms testing <class 'numpy.complex128'> + sum ... cupy : 1170.836181640625 ms numpy : 11166.1201171875 ms testing <class 'numpy.complex128'> + max ... cupy : 5515.779296875 ms numpy : 28532.884765625 ms testing <class 'numpy.complex128'> + min ... cupy : 5516.4833984375 ms numpy : 28526.55078125 ms

For CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ... cupy : 9120.3193359375 ms numpy : 8410.7587890625 ms testing <class 'numpy.complex64'> + max ... cupy : 13439.9365234375 ms numpy : 26741.98828125 ms testing <class 'numpy.complex64'> + min ... cupy : 13067.576171875 ms numpy : 26748.564453125 ms testing <class 'numpy.complex128'> + sum ... cupy : 9805.685546875 ms numpy : 10828.603515625 ms testing <class 'numpy.complex128'> + max ... cupy : 14115.16015625 ms numpy : 27617.724609375 ms testing <class 'numpy.complex128'> + min ... cupy : 13888.92578125 ms numpy : 27661.27734375 ms

I think this is the reason we wanna use CUB.

cupy/cuda/cub.pyx

leofang · 2019-10-07T19:05:12Z

Just FYI: It might be possible to implement native support of complex numbers in CUB by specializing two instances of NumericTraits<complex<float>> and NumericTraits<complex<double>>. However, until I resolve #2530 I have no way to experiment this approach locally, so let us stick to this PR for the moment.

leofang · 2019-10-09T14:07:13Z

It might be possible to implement native support of complex numbers in CUB by specializing two instances of NumericTraits<complex<float>> and NumericTraits<complex<double>>.

This is achieved in #2538, which is in favor. @toslunar, this PR is closed to avoid duplicate review effort.

leofang added 3 commits September 28, 2019 01:40

initial support for CUB + complex numbers

47daa56

strict NumPy compliance

903a606

code simplified

8f76508

leofang changed the title ~~[WIP] Fix bug in CUB + support complex numbers using CUB~~ Fix bug in CUB + support complex numbers using CUB Sep 30, 2019

leofang marked this pull request as ready for review September 30, 2019 02:34

toslunar self-assigned this Sep 30, 2019

toslunar reviewed Oct 2, 2019

View reviewed changes

apply cupy#2508 (comment)

34d6ed9

leofang mentioned this pull request Oct 3, 2019

Discussion for possible enhancements of the new CUB support #2519

Closed

10 tasks

grlee77 reviewed Oct 7, 2019

View reviewed changes

cupy/cuda/cub.pyx Show resolved Hide resolved

grlee77 reviewed Oct 7, 2019

View reviewed changes

cupy/cuda/cub.pyx Show resolved Hide resolved

This was referenced Oct 8, 2019

Fix bug in CUB + **Native support** of complex numbers in CUB #2538

Merged

Use cupy datatypes consistently in the code #2463

Open

leofang closed this Oct 13, 2019

leofang deleted the cub_complex branch October 19, 2019 03:43

kmaehashi added this to the Closed issues and PRs milestone Oct 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix bug in CUB + support complex numbers using CUB#2508

Fix bug in CUB + support complex numbers using CUB#2508
leofang wants to merge 4 commits intocupy:masterfrom
leofang:cub_complex

leofang commented Sep 29, 2019 •

edited

Loading

Uh oh!

leofang commented Sep 29, 2019 •

edited

Loading

Uh oh!

leofang commented Sep 30, 2019 •

edited

Loading

Uh oh!

toslunar left a comment

Uh oh!

Uh oh!

toslunar Oct 2, 2019

Uh oh!

leofang Oct 2, 2019

Uh oh!

Uh oh!

Uh oh!

leofang commented Oct 7, 2019 •

edited

Loading

Uh oh!

leofang commented Oct 9, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

leofang commented Sep 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Sep 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

toslunar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

toslunar Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

leofang Oct 2, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

leofang commented Oct 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leofang commented Sep 29, 2019 •

edited

Loading

leofang commented Sep 29, 2019 •

edited

Loading

leofang commented Sep 30, 2019 •

edited

Loading

leofang commented Oct 7, 2019 •

edited

Loading

leofang commented Oct 9, 2019 •

edited

Loading