Fix bug in CUB + **Native support** of complex numbers in CUB by leofang · Pull Request #2538 · cupy/cupy

leofang · 2019-10-08T20:10:25Z

This PR supersedes #2508 (was at commit 34d6ed9). Now complex-number reductions are supported natively in CUB.

Closes #2508. Addresses #2519.

Specifically, to achieve this we need two changes:

Add __host__ qualifier to basic thrust::complex operations: Note that this also complies with the latest Thrust, but more importantly we need this for 2 below. (Without this change the program would crash silently, not even throwing a segfault!)
Add two specializations of NumericalTraits: see the comment in cupy/cuda/cupy_cub.cu.

I see a much better improvement over the previous version #2508. Statistics follow below.

cc: @grlee77 @toslunar @anaruse

…b_complex

leofang · 2019-10-08T20:14:10Z

btw, thanks to @emcastillo for helping me resolve #2530!

leofang · 2019-10-08T20:36:28Z

The statistics is taken using the script from #2508 (comment). The test environment has a Tesla K40, NumPy 1.17.2, Python 3.7.2.

First, for the best-case scenario (a random-number array; see the discussion in #2508 (comment)): With CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 933.5860595703125 ms
numpy : 7430.88525390625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 959.6272583007812 ms
numpy : 19503.220703125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 954.2632446289062 ms
numpy : 19516.017578125 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1895.904052734375 ms
numpy : 9782.234375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 1869.08740234375 ms
numpy : 20967.619140625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 1853.9271240234375 ms
numpy : 20883.990234375 ms

With CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11691.6845703125 ms
numpy : 7313.32177734375 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20772.88671875 ms
numpy : 20053.484375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 19223.205078125 ms
numpy : 19607.2890625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13719.2109375 ms
numpy : 9506.220703125 ms
testing <class 'numpy.complex128'> + max ...
cupy : 21803.724609375 ms
numpy : 20842.4140625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 20630.646484375 ms
numpy : 20842.595703125 ms

Next, for the worst-case scenario (a large constant array): With CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 927.1622924804688 ms
numpy : 7389.314453125 ms
testing <class 'numpy.complex64'> + max ...
cupy : 945.6170654296875 ms
numpy : 22317.068359375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 944.6876831054688 ms
numpy : 22350.94921875 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1873.5682373046875 ms
numpy : 9709.2421875 ms
testing <class 'numpy.complex128'> + max ...
cupy : 1838.181884765625 ms
numpy : 23334.5859375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 1837.8828125 ms
numpy : 23280.82421875 ms

With CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11355.7841796875 ms
numpy : 7215.248046875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20950.353515625 ms
numpy : 22390.94140625 ms
testing <class 'numpy.complex64'> + min ...
cupy : 20206.13671875 ms
numpy : 22422.734375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13527.4794921875 ms
numpy : 9529.99609375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 22085.865234375 ms
numpy : 23412.68359375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 21582.56640625 ms
numpy : 23378.427734375 ms

Note that in addition to the tremendous speedup, the double complex cases takes about twice longer than the single complex ones, showing that there's no Python overhead and that the spent time is computation-dominating.

~~If there's interest I could re-run the tests at commit 34d6ed9 in this test environment. (A bit lazy now.)~~

leofang · 2019-10-09T14:51:47Z

For comparison, the performance of the old Python implementation #2508 (commit 34d6ed9) in the same test environment is given below.

Best case + CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 2003.8343505859375 ms
numpy : 7260.60888671875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 7911.07177734375 ms
numpy : 20359.08984375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 6198.251953125 ms
numpy : 19549.482421875 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 3602.372802734375 ms
numpy : 9495.134765625 ms
testing <class 'numpy.complex128'> + max ...
cupy : 7724.533203125 ms
numpy : 20779.3203125 ms
testing <class 'numpy.complex128'> + min ...
cupy : 7231.24951171875 ms
numpy : 20690.900390625 ms

CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11630.3623046875 ms
numpy : 7203.65625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20630.8359375 ms
numpy : 19476.669921875 ms
testing <class 'numpy.complex64'> + min ...
cupy : 19205.529296875 ms
numpy : 19557.978515625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13742.28125 ms
numpy : 9495.130859375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 21872.9921875 ms
numpy : 20706.802734375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 20595.66796875 ms
numpy : 20680.96875 ms

Worst case + CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 1723.978515625 ms
numpy : 7220.60400390625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 14753.4794921875 ms
numpy : 22732.345703125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 14762.6923828125 ms
numpy : 22321.556640625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 3386.323974609375 ms
numpy : 9486.8037109375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 16633.728515625 ms
numpy : 23227.875 ms
testing <class 'numpy.complex128'> + min ...
cupy : 16633.47265625 ms
numpy : 23345.580078125 ms

CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11335.9609375 ms
numpy : 7238.40966796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20947.6640625 ms
numpy : 22308.912109375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 20203.267578125 ms
numpy : 22354.462890625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13486.8017578125 ms
numpy : 9504.3115234375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 22084.146484375 ms
numpy : 23450.060546875 ms
testing <class 'numpy.complex128'> + min ...
cupy : 21582.712890625 ms
numpy : 23377.8828125 ms

In short, the native CUB version has a huge speedup over the half-Python version, especially for min() and max(), as it has fewer searches and unnecessary Python overhead.

grlee77 · 2019-10-09T19:49:01Z

The performance here looks great. There is a potential concern about maintainability given that the CUB reduction code in cub/device/device_reduce.cuh has the following comment:

OutputTupleT initial_value(1, Traits<InputValueT>::Max());   // replace with std::numeric_limits<T>::max() when C++11 support is more prevalent

If a future version of CUB does replace Traits<InputValueT>::Max() (NumericTraits) with std::numeric_limits<T>::max() then it seems that would break the max/min reductions here? I suppose if we bundled CUB like you asked about it in #2519, then we could just not make that change in the local copy.

leofang · 2019-10-09T20:00:19Z

If a future version of CUB does replace Traits<InputValueT>::Max() (NumericTraits) with std::numeric_limits<T>::max() then it seems that would break the max/min reductions here? I suppose if we bundled CUB like you asked about it in #2519, then we could just not make that change in the local copy.

It should be OK even if we don't bundle a copy of CUB. Note that std::numeric_limits<T> is a template, and we can always provide the same specializations as we do here, see below:

Implementations may provide specializations of std::numeric_limits for implementation-specific types: e.g. GCC provides std::numeric_limits<__int128>. Non-standard libraries may add specializations for library-provided types, e.g. OpenEXR provides std::numeric_limits for a 16-bit floating-point type.

ref: https://en.cppreference.com/w/cpp/types/numeric_limits

leofang · 2019-10-09T20:14:58Z

A note for myself: the max() and min() reductions are enabled by the NumPy-compliant max() and min() implementations for complex numbers (#1285):

cupy/cupy/core/include/cupy/complex.cuh

Lines 50 to 79 in 22d6d0a

    
           template<typename T> __device__ complex<T> min(complex<T> x, complex<T> y) { 
        
               if (isnan(x)) { 
        
                   return y; 
        
               } else if (isnan(y)) { 
        
                   return x; 
        
               } else if (x.real() < y.real()) { 
        
                   return x; 
        
               } else if (x.real() > y.real()) { 
        
                   return y; 
        
               } else if (x.imag() < y.imag()) { 
        
                   return x; 
        
               } else { 
        
                   return y; 
        
               } 
        
           } 
        
           template<typename T> __device__ complex<T> max(complex<T> x, complex<T> y) { 
        
               if (isnan(x)) { 
        
                   return y; 
        
               } else if (isnan(y)) { 
        
                   return x; 
        
               } else if (x.real() < y.real()) { 
        
                   return y; 
        
               } else if (x.real() > y.real()) { 
        
                   return x; 
        
               } else if (x.imag() < y.imag()) { 
        
                   return y; 
        
               } else { 
        
                   return x; 
        
               } 
        
           }

emcastillo

LGTM

emcastillo · 2019-10-18T04:32:08Z

Jenkins & pfnCIs test this please

leofang · 2019-10-18T05:00:02Z

@emcastillo quick question: are the CIs configured to test CUB (that is, are CUB_PATH and CUB_DISABLED=0 set)?

emcastillo · 2019-10-18T05:06:14Z

CIs do not support CUB, cuTENSOR or other libraries right now

leofang · 2019-10-18T05:10:51Z

Oh, I see. So now the CI tests are just for ensuring we don't break things in CUB- / cuTENSOR- related PRs? (By CI I meant Jenkins btw.)

emcastillo · 2019-10-18T05:17:32Z

Exactly!

chainer-ci · 2019-10-18T05:59:59Z

Jenkins CI test (for commit 86e0b15, target branch master) succeeded!

leofang · 2019-10-18T06:25:46Z

Thanks @emcastillo!

leofang added 8 commits September 28, 2019 01:40

initial support for CUB + complex numbers

47daa56

strict NumPy compliance

903a606

code simplified

8f76508

apply cupy#2508 (comment)

34d6ed9

Merge branch 'cub_complex' of https://github.com/leofang/cupy into cu…

87143da

…b_complex

[WIP] support complex numbers in CUB; this compiles

d9cffbb

fully works

854cc5f

remove redundant line

86e0b15

leofang mentioned this pull request Oct 8, 2019

Fix bug in CUB + support complex numbers using CUB #2508

Closed

Crissman assigned emcastillo Oct 9, 2019

leofang mentioned this pull request Oct 9, 2019

Discussion for possible enhancements of the new CUB support #2519

Closed

10 tasks

emcastillo approved these changes Oct 18, 2019

View reviewed changes

emcastillo added the cat:enhancement Improvements to existing features label Oct 18, 2019

emcastillo added this to the v7.0.0rc1 milestone Oct 18, 2019

emcastillo merged commit cb54853 into cupy:master Oct 18, 2019

leofang deleted the cub_complex2 branch October 18, 2019 06:25

leofang mentioned this pull request Oct 31, 2019

Refactor CUB to support an explicit axis argument; Fix alignments for Thrust's complex types #2562

Merged

leofang mentioned this pull request Nov 11, 2019

Sync the headers in cupy/core/include/cupy/complex/ with upstream? #2629

Closed

Uh oh!

Conversation

leofang commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Oct 8, 2019

Uh oh!

leofang commented Oct 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grlee77 commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Oct 9, 2019

Uh oh!

leofang commented Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcastillo left a comment

Choose a reason for hiding this comment

Uh oh!

emcastillo commented Oct 18, 2019

Uh oh!

leofang commented Oct 18, 2019

Uh oh!

emcastillo commented Oct 18, 2019

Uh oh!

leofang commented Oct 18, 2019

Uh oh!

emcastillo commented Oct 18, 2019

Uh oh!

chainer-ci commented Oct 18, 2019

Uh oh!

leofang commented Oct 18, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leofang commented Oct 8, 2019 •

edited

Loading

leofang commented Oct 8, 2019 •

edited

Loading

leofang commented Oct 9, 2019 •

edited

Loading

grlee77 commented Oct 9, 2019 •

edited

Loading

leofang commented Oct 9, 2019 •

edited

Loading