Skip to content

Fix bug in CUB + **Native support** of complex numbers in CUB #2538

Merged
emcastillo merged 8 commits intocupy:masterfrom
leofang:cub_complex2
Oct 18, 2019
Merged

Fix bug in CUB + **Native support** of complex numbers in CUB #2538
emcastillo merged 8 commits intocupy:masterfrom
leofang:cub_complex2

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Oct 8, 2019

This PR supersedes #2508 (was at commit 34d6ed9). Now complex-number reductions are supported natively in CUB.

Closes #2508. Addresses #2519.

Specifically, to achieve this we need two changes:

  1. Add __host__ qualifier to basic thrust::complex operations: Note that this also complies with the latest Thrust, but more importantly we need this for 2 below. (Without this change the program would crash silently, not even throwing a segfault!)
  2. Add two specializations of NumericalTraits: see the comment in cupy/cuda/cupy_cub.cu.

I see a much better improvement over the previous version #2508. Statistics follow below.

cc: @grlee77 @toslunar @anaruse

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 8, 2019

btw, thanks to @emcastillo for helping me resolve #2530!

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 8, 2019

The statistics is taken using the script from #2508 (comment). The test environment has a Tesla K40, NumPy 1.17.2, Python 3.7.2.

First, for the best-case scenario (a random-number array; see the discussion in #2508 (comment)): With CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 933.5860595703125 ms
numpy : 7430.88525390625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 959.6272583007812 ms
numpy : 19503.220703125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 954.2632446289062 ms
numpy : 19516.017578125 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1895.904052734375 ms
numpy : 9782.234375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 1869.08740234375 ms
numpy : 20967.619140625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 1853.9271240234375 ms
numpy : 20883.990234375 ms

With CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11691.6845703125 ms
numpy : 7313.32177734375 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20772.88671875 ms
numpy : 20053.484375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 19223.205078125 ms
numpy : 19607.2890625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13719.2109375 ms
numpy : 9506.220703125 ms
testing <class 'numpy.complex128'> + max ...
cupy : 21803.724609375 ms
numpy : 20842.4140625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 20630.646484375 ms
numpy : 20842.595703125 ms

Next, for the worst-case scenario (a large constant array): With CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 927.1622924804688 ms
numpy : 7389.314453125 ms
testing <class 'numpy.complex64'> + max ...
cupy : 945.6170654296875 ms
numpy : 22317.068359375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 944.6876831054688 ms
numpy : 22350.94921875 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1873.5682373046875 ms
numpy : 9709.2421875 ms
testing <class 'numpy.complex128'> + max ...
cupy : 1838.181884765625 ms
numpy : 23334.5859375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 1837.8828125 ms
numpy : 23280.82421875 ms

With CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11355.7841796875 ms
numpy : 7215.248046875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20950.353515625 ms
numpy : 22390.94140625 ms
testing <class 'numpy.complex64'> + min ...
cupy : 20206.13671875 ms
numpy : 22422.734375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13527.4794921875 ms
numpy : 9529.99609375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 22085.865234375 ms
numpy : 23412.68359375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 21582.56640625 ms
numpy : 23378.427734375 ms

Note that in addition to the tremendous speedup, the double complex cases takes about twice longer than the single complex ones, showing that there's no Python overhead and that the spent time is computation-dominating.

If there's interest I could re-run the tests at commit 34d6ed9 in this test environment. (A bit lazy now.)

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 9, 2019

For comparison, the performance of the old Python implementation #2508 (commit 34d6ed9) in the same test environment is given below.

Best case + CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 2003.8343505859375 ms
numpy : 7260.60888671875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 7911.07177734375 ms
numpy : 20359.08984375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 6198.251953125 ms
numpy : 19549.482421875 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 3602.372802734375 ms
numpy : 9495.134765625 ms
testing <class 'numpy.complex128'> + max ...
cupy : 7724.533203125 ms
numpy : 20779.3203125 ms
testing <class 'numpy.complex128'> + min ...
cupy : 7231.24951171875 ms
numpy : 20690.900390625 ms

CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11630.3623046875 ms
numpy : 7203.65625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20630.8359375 ms
numpy : 19476.669921875 ms
testing <class 'numpy.complex64'> + min ...
cupy : 19205.529296875 ms
numpy : 19557.978515625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13742.28125 ms
numpy : 9495.130859375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 21872.9921875 ms
numpy : 20706.802734375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 20595.66796875 ms
numpy : 20680.96875 ms

Worst case + CUB_DISABLED=0:

testing <class 'numpy.complex64'> + sum ...
cupy : 1723.978515625 ms
numpy : 7220.60400390625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 14753.4794921875 ms
numpy : 22732.345703125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 14762.6923828125 ms
numpy : 22321.556640625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 3386.323974609375 ms
numpy : 9486.8037109375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 16633.728515625 ms
numpy : 23227.875 ms
testing <class 'numpy.complex128'> + min ...
cupy : 16633.47265625 ms
numpy : 23345.580078125 ms

CUB_DISABLED=1:

testing <class 'numpy.complex64'> + sum ...
cupy : 11335.9609375 ms
numpy : 7238.40966796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20947.6640625 ms
numpy : 22308.912109375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 20203.267578125 ms
numpy : 22354.462890625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13486.8017578125 ms
numpy : 9504.3115234375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 22084.146484375 ms
numpy : 23450.060546875 ms
testing <class 'numpy.complex128'> + min ...
cupy : 21582.712890625 ms
numpy : 23377.8828125 ms

In short, the native CUB version has a huge speedup over the half-Python version, especially for min() and max(), as it has fewer searches and unnecessary Python overhead.

@grlee77
Copy link
Copy Markdown
Member

grlee77 commented Oct 9, 2019

The performance here looks great. There is a potential concern about maintainability given that the CUB reduction code in cub/device/device_reduce.cuh has the following comment:

OutputTupleT initial_value(1, Traits<InputValueT>::Max());   // replace with std::numeric_limits<T>::max() when C++11 support is more prevalent

If a future version of CUB does replace Traits<InputValueT>::Max() (NumericTraits) with std::numeric_limits<T>::max() then it seems that would break the max/min reductions here? I suppose if we bundled CUB like you asked about it in #2519, then we could just not make that change in the local copy.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 9, 2019

If a future version of CUB does replace Traits<InputValueT>::Max() (NumericTraits) with std::numeric_limits<T>::max() then it seems that would break the max/min reductions here? I suppose if we bundled CUB like you asked about it in #2519, then we could just not make that change in the local copy.

It should be OK even if we don't bundle a copy of CUB. Note that std::numeric_limits<T> is a template, and we can always provide the same specializations as we do here, see below:

Implementations may provide specializations of std::numeric_limits for implementation-specific types: e.g. GCC provides std::numeric_limits<__int128>. Non-standard libraries may add specializations for library-provided types, e.g. OpenEXR provides std::numeric_limits for a 16-bit floating-point type.

ref: https://en.cppreference.com/w/cpp/types/numeric_limits

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 9, 2019

A note for myself: the max() and min() reductions are enabled by the NumPy-compliant max() and min() implementations for complex numbers (#1285):

template<typename T> __device__ complex<T> min(complex<T> x, complex<T> y) {
if (isnan(x)) {
return y;
} else if (isnan(y)) {
return x;
} else if (x.real() < y.real()) {
return x;
} else if (x.real() > y.real()) {
return y;
} else if (x.imag() < y.imag()) {
return x;
} else {
return y;
}
}
template<typename T> __device__ complex<T> max(complex<T> x, complex<T> y) {
if (isnan(x)) {
return y;
} else if (isnan(y)) {
return x;
} else if (x.real() < y.real()) {
return y;
} else if (x.real() > y.real()) {
return x;
} else if (x.imag() < y.imag()) {
return y;
} else {
return x;
}
}

Copy link
Copy Markdown
Member

@emcastillo emcastillo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@emcastillo
Copy link
Copy Markdown
Member

Jenkins & pfnCIs test this please

@emcastillo emcastillo added the cat:enhancement Improvements to existing features label Oct 18, 2019
@emcastillo emcastillo added this to the v7.0.0rc1 milestone Oct 18, 2019
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 18, 2019

@emcastillo quick question: are the CIs configured to test CUB (that is, are CUB_PATH and CUB_DISABLED=0 set)?

@emcastillo
Copy link
Copy Markdown
Member

CIs do not support CUB, cuTENSOR or other libraries right now

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 18, 2019

Oh, I see. So now the CI tests are just for ensuring we don't break things in CUB- / cuTENSOR- related PRs? (By CI I meant Jenkins btw.)

@emcastillo
Copy link
Copy Markdown
Member

Exactly!

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 86e0b15, target branch master) succeeded!

@emcastillo emcastillo merged commit cb54853 into cupy:master Oct 18, 2019
@leofang leofang deleted the cub_complex2 branch October 18, 2019 06:25
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Oct 18, 2019

Thanks @emcastillo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:enhancement Improvements to existing features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants