Fix bug in CUB + **Native support** of complex numbers in CUB #2538
Fix bug in CUB + **Native support** of complex numbers in CUB #2538emcastillo merged 8 commits intocupy:masterfrom
Conversation
|
btw, thanks to @emcastillo for helping me resolve #2530! |
|
The statistics is taken using the script from #2508 (comment). The test environment has a Tesla K40, NumPy 1.17.2, Python 3.7.2. First, for the best-case scenario (a random-number array; see the discussion in #2508 (comment)): With testing <class 'numpy.complex64'> + sum ...
cupy : 933.5860595703125 ms
numpy : 7430.88525390625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 959.6272583007812 ms
numpy : 19503.220703125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 954.2632446289062 ms
numpy : 19516.017578125 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1895.904052734375 ms
numpy : 9782.234375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 1869.08740234375 ms
numpy : 20967.619140625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 1853.9271240234375 ms
numpy : 20883.990234375 msWith testing <class 'numpy.complex64'> + sum ...
cupy : 11691.6845703125 ms
numpy : 7313.32177734375 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20772.88671875 ms
numpy : 20053.484375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 19223.205078125 ms
numpy : 19607.2890625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13719.2109375 ms
numpy : 9506.220703125 ms
testing <class 'numpy.complex128'> + max ...
cupy : 21803.724609375 ms
numpy : 20842.4140625 ms
testing <class 'numpy.complex128'> + min ...
cupy : 20630.646484375 ms
numpy : 20842.595703125 msNext, for the worst-case scenario (a large constant array): With testing <class 'numpy.complex64'> + sum ...
cupy : 927.1622924804688 ms
numpy : 7389.314453125 ms
testing <class 'numpy.complex64'> + max ...
cupy : 945.6170654296875 ms
numpy : 22317.068359375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 944.6876831054688 ms
numpy : 22350.94921875 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 1873.5682373046875 ms
numpy : 9709.2421875 ms
testing <class 'numpy.complex128'> + max ...
cupy : 1838.181884765625 ms
numpy : 23334.5859375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 1837.8828125 ms
numpy : 23280.82421875 msWith testing <class 'numpy.complex64'> + sum ...
cupy : 11355.7841796875 ms
numpy : 7215.248046875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20950.353515625 ms
numpy : 22390.94140625 ms
testing <class 'numpy.complex64'> + min ...
cupy : 20206.13671875 ms
numpy : 22422.734375 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13527.4794921875 ms
numpy : 9529.99609375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 22085.865234375 ms
numpy : 23412.68359375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 21582.56640625 ms
numpy : 23378.427734375 msNote that in addition to the tremendous speedup, the double complex cases takes about twice longer than the single complex ones, showing that there's no Python overhead and that the spent time is computation-dominating.
|
|
For comparison, the performance of the old Python implementation #2508 (commit 34d6ed9) in the same test environment is given below. Best case + testing <class 'numpy.complex64'> + sum ...
cupy : 2003.8343505859375 ms
numpy : 7260.60888671875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 7911.07177734375 ms
numpy : 20359.08984375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 6198.251953125 ms
numpy : 19549.482421875 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 3602.372802734375 ms
numpy : 9495.134765625 ms
testing <class 'numpy.complex128'> + max ...
cupy : 7724.533203125 ms
numpy : 20779.3203125 ms
testing <class 'numpy.complex128'> + min ...
cupy : 7231.24951171875 ms
numpy : 20690.900390625 ms
testing <class 'numpy.complex64'> + sum ...
cupy : 11630.3623046875 ms
numpy : 7203.65625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20630.8359375 ms
numpy : 19476.669921875 ms
testing <class 'numpy.complex64'> + min ...
cupy : 19205.529296875 ms
numpy : 19557.978515625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13742.28125 ms
numpy : 9495.130859375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 21872.9921875 ms
numpy : 20706.802734375 ms
testing <class 'numpy.complex128'> + min ...
cupy : 20595.66796875 ms
numpy : 20680.96875 msWorst case + testing <class 'numpy.complex64'> + sum ...
cupy : 1723.978515625 ms
numpy : 7220.60400390625 ms
testing <class 'numpy.complex64'> + max ...
cupy : 14753.4794921875 ms
numpy : 22732.345703125 ms
testing <class 'numpy.complex64'> + min ...
cupy : 14762.6923828125 ms
numpy : 22321.556640625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 3386.323974609375 ms
numpy : 9486.8037109375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 16633.728515625 ms
numpy : 23227.875 ms
testing <class 'numpy.complex128'> + min ...
cupy : 16633.47265625 ms
numpy : 23345.580078125 ms
testing <class 'numpy.complex64'> + sum ...
cupy : 11335.9609375 ms
numpy : 7238.40966796875 ms
testing <class 'numpy.complex64'> + max ...
cupy : 20947.6640625 ms
numpy : 22308.912109375 ms
testing <class 'numpy.complex64'> + min ...
cupy : 20203.267578125 ms
numpy : 22354.462890625 ms
testing <class 'numpy.complex128'> + sum ...
cupy : 13486.8017578125 ms
numpy : 9504.3115234375 ms
testing <class 'numpy.complex128'> + max ...
cupy : 22084.146484375 ms
numpy : 23450.060546875 ms
testing <class 'numpy.complex128'> + min ...
cupy : 21582.712890625 ms
numpy : 23377.8828125 msIn short, the native CUB version has a huge speedup over the half-Python version, especially for |
|
The performance here looks great. There is a potential concern about maintainability given that the CUB reduction code in If a future version of CUB does replace |
It should be OK even if we don't bundle a copy of CUB. Note that
|
|
A note for myself: the cupy/cupy/core/include/cupy/complex.cuh Lines 50 to 79 in 22d6d0a |
|
Jenkins & pfnCIs test this please |
|
@emcastillo quick question: are the CIs configured to test CUB (that is, are |
|
CIs do not support CUB, cuTENSOR or other libraries right now |
|
Oh, I see. So now the CI tests are just for ensuring we don't break things in CUB- / cuTENSOR- related PRs? (By CI I meant Jenkins btw.) |
|
Exactly! |
|
Jenkins CI test (for commit 86e0b15, target branch master) succeeded! |
|
Thanks @emcastillo! |
This PR supersedes #2508 (was at commit 34d6ed9). Now complex-number reductions are supported natively in CUB.
Closes #2508. Addresses #2519.
Specifically, to achieve this we need two changes:
__host__qualifier to basicthrust::complexoperations: Note that this also complies with the latest Thrust, but more importantly we need this for 2 below. (Without this change the program would crash silently, not even throwing a segfault!)NumericalTraits: see the comment incupy/cuda/cupy_cub.cu.I see a much better improvement over the previous version #2508. Statistics follow below.
cc: @grlee77 @toslunar @anaruse