Add fp16 support of CUB by y1r · Pull Request #2600 · cupy/cupy

y1r · 2019-11-04T03:38:16Z

Currently, the CUB binding of CuPy doesn't support fp16 but CUB supports it.

I implemented fp16 support of CUB binding by using float16 class on carray.cuh.
I am not sure what class I should use for this implementation, so please give me comments.

leofang

Thanks a lot for filling in timely this missing support. I wanted to try it myself, but decided that it's not my priority.😅

I have three questions on this PR:

As far as I can tell -- correct me if I am wrong -- the support of half precision (__half) begins only in recent CUDA versions (since 9.0 I suppose? Not sure why CuPy wants to bundle cuda_fp16.h for 9.2 and above...). If we want to enable CUB, we should test if __CUDACC_VER_MAJOR__ >= 9, and drop the support for, say, CUDA 8.0. And this needs to be done in both the C++ and the Python codes.
For the CUDA versions that support __half, CUB already specialized a NumericTraits for it, so I don't see why we need to include <cupy/carray.cuh> and to duplicate the effort. Can you test if the native CUB support for __half just works?
I am interested in the performance gain for float16. Can you modify the test in #2598 and show the performance comparison for it? (You just need to modify the strings lLfdFD to lLefdFD and bhilBHILfdFD to bhilBHILefdFD there, i.e., adding 'e'. Run it using python -m pytest -rA -v ... to see the full stdout.)

Thanks again.

cupy/cuda/cupy_cub.cu

y1r · 2019-11-04T04:43:04Z

Thank you for the quick response.

As far as I can tell -- correct me if I am wrong -- the support of half precision (__half) begins only in recent CUDA versions (since 9.0 I suppose? Not sure why CuPy wants to bundle cuda_fp16.h for 9.2 and above...). If we want to enable CUB, we should test if CUDACC_VER_MAJOR >= 9, and drop the support for, say, CUDA 8.0. And this needs to be done in both the C++ and the Python codes.

half-precision is supported since CUDA 7.5 (Compute Capability 5.3 and above). I am wondering that CUDA_VERSION (i.e. __CUDACC_VER_MAJOR__ ) can be tested on host-code (dtype dispatcher) but CC can be tested on device-code only (i.e. inside of CUB) so we cannot handle this checking using dtype dispacher only. What do you think?

For the CUDA versions that support __half, CUB already specialized a NumericTraits for it, so I don't see why we need to include <cupy/carray.cuh> and to duplicate the effort. Can you test if the native CUB support for __half just works?

Yes, it works on my environment (CUDA and CC is recent version, V100 SXM2 16GB).

I am interested in the performance gain for float16. Can you modify the test in Add tests for cupy.cuda.cub #2598 and show the performance comparison for it? (You just need to modify the strings lLfdFD to lLefdFD and bhilBHILfdFD to bhilBHILefdFD there, i.e., adding 'e'. Run it using python -m pytest -rA -v ... to see the full stdout.)

My implementation cannot pass the TestCUBreduction_param_(0,1,2,3) test of summation.
min/max are working correctly so the problem is tolerance.
The performance:
I think fp16-reduction and fp32-reduction is same performance.

___________________________________________________ TestCUBperformance_param_0.test_cub_argmax_performance ___________________________________________________                                                                                                                                                       [444/913]
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.53638; CuPy:    0.44403 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int8
CUB:    0.46560; CuPy:    0.43827 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int16
CUB:    0.46755; CuPy:    0.44243 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int32
CUB:    0.46566; CuPy:    0.44173 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int64
CUB:    0.45581; CuPy:    0.44032 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint8
CUB:    0.46182; CuPy:    0.44304 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint16
CUB:    0.49322; CuPy:    0.46234 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint32
CUB:    0.47968; CuPy:    0.43965 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint64
CUB:    0.46576; CuPy:    0.44080 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float16
CUB:    0.45779; CuPy:    0.43766 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float32
CUB:    0.46157; CuPy:    0.43888 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float64
CUB:    0.46051; CuPy:    0.44883 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex64
CUB:    0.49443; CuPy:    0.47430 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_0.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.49427; CuPy:    0.45098 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int8
CUB:    0.49088; CuPy:    0.44838 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int16
CUB:    0.47494; CuPy:    0.44950 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int32
CUB:    0.48454; CuPy:    0.45344 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int64
CUB:    0.48560; CuPy:    0.45139 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint8
CUB:    0.47238; CuPy:    0.44330 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint16
CUB:    0.46390; CuPy:    0.45165 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint32
CUB:    0.46810; CuPy:    0.44618 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint64
CUB:    0.46618; CuPy:    0.44614 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float16
CUB:    0.46320; CuPy:    0.44643 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float32
CUB:    0.46490; CuPy:    0.44960 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float64
CUB:    0.46752; CuPy:    0.45152 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex64
CUB:    0.50115; CuPy:    0.48397 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_0.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.42752; CuPy:  182.88009 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int8
CUB:    0.46131; CuPy:  181.48398 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int16
CUB:    0.42496; CuPy:  179.44694 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int32
CUB:    0.46544; CuPy:  187.05462 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int64
CUB:    0.43245; CuPy:  183.22202 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint8
CUB:    0.45005; CuPy:  181.92096 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint16
CUB:    0.44909; CuPy:  181.16878 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint32
CUB:    0.47280; CuPy:  185.60614 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint64
CUB:    0.47437; CuPy:  186.01673 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float16
CUB:    0.47293; CuPy:  183.08029 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float32
CUB:    0.48160; CuPy:  185.79357 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float64
CUB:    0.48534; CuPy:  194.45978 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex64
CUB:    0.49210; CuPy:  198.73137 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex128
____________________________________________________ TestCUBperformance_param_0.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.46784; CuPy:  181.68195 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int8
CUB:    0.45120; CuPy:  184.16393 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int16
CUB:    0.45786; CuPy:  181.88711 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int32
CUB:    0.47843; CuPy:  187.09404 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int64
CUB:    0.33104; CuPy:  176.07049 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint8
CUB:    0.35632; CuPy:  176.40704 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint16
CUB:    0.35485; CuPy:  173.97667 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint32
CUB:    0.35962; CuPy:  180.57930 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint64
CUB:    0.37786; CuPy:  178.41405 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float16
CUB:    0.38483; CuPy:  174.80364 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float32
CUB:    0.37920; CuPy:  184.11764 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float64
CUB:    0.38256; CuPy:  187.11802 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex64
CUB:    0.38874; CuPy:  191.57011 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex128
____________________________________________________ TestCUBperformance_param_0.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.39245; CuPy:  174.21158 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int64
CUB:    0.35926; CuPy:  171.72423 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint64
CUB:    0.38867; CuPy:  149.16666 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float16
CUB:    0.37680; CuPy:  148.32182 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float32
CUB:    0.40358; CuPy:  170.23859 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float64
CUB:    0.38755; CuPy:  174.08566 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex64
CUB:    0.38851; CuPy:  174.96090 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex128
___________________________________________________ TestCUBperformance_param_1.test_cub_argmax_performance ___________________________________________________                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [315/853]
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    3.36477; CuPy:    3.31962 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int8
CUB:    3.19670; CuPy:    3.17776 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int16
CUB:    3.15034; CuPy:    3.14899 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int32
CUB:    3.38701; CuPy:    3.36918 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int64
CUB:    3.28326; CuPy:    3.28314 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint8
CUB:    3.19168; CuPy:    3.19203 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint16
CUB:    3.14339; CuPy:    3.13946 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint32
CUB:    3.37674; CuPy:    3.37280 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint64
CUB:    3.78746; CuPy:    3.78621 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float16
CUB:    3.51754; CuPy:    3.51306 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float32
CUB:    3.72109; CuPy:    3.71888 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float64
CUB:    4.24086; CuPy:    4.22317 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex64
CUB:    4.03082; CuPy:    4.01472 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_1.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    2.93709; CuPy:    2.89360 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int8
CUB:    2.82720; CuPy:    2.78515 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int16
CUB:    2.75728; CuPy:    2.76499 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int32
CUB:    2.97498; CuPy:    2.95437 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int64
CUB:    2.86819; CuPy:    2.89011 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint8
CUB:    2.80438; CuPy:    2.79872 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint16
CUB:    2.75459; CuPy:    2.74749 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint32
CUB:    2.96234; CuPy:    2.95626 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint64
CUB:    3.30410; CuPy:    3.29382 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float16
CUB:    3.06954; CuPy:    3.07091 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float32
CUB:    3.25411; CuPy:    3.23779 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float64
CUB:    3.68048; CuPy:    3.68320 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex64
CUB:    3.99610; CuPy:    3.99715 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_1.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.74717; CuPy:    2.16448 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int8
CUB:    0.43488; CuPy:    2.16003 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int16
CUB:    0.42330; CuPy:    2.17318 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int32
CUB:    0.43594; CuPy:    2.45549 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int64
CUB:    0.42259; CuPy:    2.15184 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint8
CUB:    0.42362; CuPy:    2.12723 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint16
CUB:    0.42509; CuPy:    2.16886 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint32
CUB:    0.43232; CuPy:    2.45277 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint64
CUB:    0.43968; CuPy:    2.93117 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float16
CUB:    0.44861; CuPy:    2.67808 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float32
CUB:    0.44102; CuPy:    3.04653 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float64
CUB:    0.44710; CuPy:    3.81840 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex64
CUB:    0.46397; CuPy:    4.06406 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex128
CUB:    2.38502; CuPy:  197.75564 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int8
CUB:    2.34384; CuPy:  195.23818 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int16
CUB:    2.30403; CuPy:  195.72090 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int32
CUB:    2.27952; CuPy:  195.45110 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int64
CUB:    2.38899; CuPy:  184.30937 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint8
CUB:    2.35277; CuPy:  186.84455 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint16
CUB:    2.35219; CuPy:  185.43959 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint32
CUB:    2.25824; CuPy:  190.19331 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint64
CUB:    2.34480; CuPy:  160.54877 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float16
CUB:    2.22499; CuPy:  188.62803 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float32
CUB:    2.28282; CuPy:  191.49294 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float64
CUB:    2.35168; CuPy:  199.99507 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex64
CUB:    2.13699; CuPy:  205.91600 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex128
____________________________________________________ TestCUBperformance_param_1.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.74886; CuPy:    2.17565 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int8
CUB:    0.43926; CuPy:    2.14573 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int16
CUB:    0.41606; CuPy:    2.16694 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int32
CUB:    0.44640; CuPy:    2.45328 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int64
CUB:    0.40560; CuPy:    2.14442 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint8
CUB:    0.41987; CuPy:    2.34758 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint16
CUB:    0.41421; CuPy:    2.46688 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint32
CUB:    0.42144; CuPy:    2.79453 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint64
CUB:    0.41859; CuPy:    3.34858 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float16
CUB:    0.42816; CuPy:    3.06381 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float32
CUB:    0.42864; CuPy:    3.49830 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float64
CUB:    0.44739; CuPy:    4.18598 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex64
CUB:    0.50118; CuPy:    4.46157 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex128
CUB:    2.18883; CuPy:  187.48228 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int8
CUB:    2.33629; CuPy:  186.00631 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int16
CUB:    2.67222; CuPy:  194.68614 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int32
CUB:    2.24550; CuPy:  189.38477 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int64
CUB:    2.37840; CuPy:  187.92944 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint8
CUB:    2.37398; CuPy:  187.73958 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint16
CUB:    2.32874; CuPy:  185.54112 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint32
CUB:    2.31229; CuPy:  189.36177 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint64
CUB:    2.69846; CuPy:  196.61235 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float16
CUB:    2.71011; CuPy:  195.82343 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float32
CUB:    2.64083; CuPy:  197.55507 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float64
CUB:    2.54042; CuPy:  208.71427 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex64
CUB:    2.50986; CuPy:  210.69037 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex128
____________________________________________________ TestCUBperformance_param_1.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.75827; CuPy:    2.65552 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int64
CUB:    0.54611; CuPy:    2.64861 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint64
CUB:    0.55434; CuPy:    2.42387 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float16
CUB:    0.52058; CuPy:    2.47011 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float32
CUB:    0.53376; CuPy:    2.65155 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float64
CUB:    0.52707; CuPy:    2.65178 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex64
CUB:    0.55286; CuPy:    2.93974 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex128
CUB:    2.41578; CuPy:  189.06758 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int64
CUB:    2.57014; CuPy:  187.65114 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint64
CUB:    2.77382; CuPy:  160.22694 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float16
CUB:    2.72045; CuPy:  158.73507 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float32
CUB:    2.65936; CuPy:  187.65065 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float64
CUB:    2.58074; CuPy:  191.26474 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex64
CUB:    2.47718; CuPy:  190.82137 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex128
___________________________________________________ TestCUBperformance_param_2.test_cub_argmax_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  144.74288; CuPy:  141.11178 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int8
CUB:  131.39590; CuPy:  131.29507 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int16
CUB:  132.54224; CuPy:  132.50589 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int32
CUB:  140.44058; CuPy:  140.42630 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int64
CUB:  126.29322; CuPy:  126.32611 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint8
CUB:  132.92016; CuPy:  132.77961 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint16
CUB:  132.52570; CuPy:  132.50493 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint32
CUB:  140.42125; CuPy:  140.41130 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint64
CUB:  147.42102; CuPy:  147.37606 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float16
CUB:  141.54406; CuPy:  141.52698 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float32
CUB:  149.32192; CuPy:  149.27712 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float64
CUB:  161.37901; CuPy:  161.32704 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex64
CUB:  175.62746; CuPy:  175.64835 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_2.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  131.97466; CuPy:  132.08854 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int8
CUB:  131.31168; CuPy:  131.30867 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int16
CUB:  132.50131; CuPy:  132.48125 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int32
CUB:  140.47920; CuPy:  140.56077 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int64
CUB:  126.29142; CuPy:  126.32806 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint8
CUB:  132.76314; CuPy:  132.79702 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint16
CUB:  132.58889; CuPy:  132.49034 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint32
CUB:  140.29738; CuPy:  140.41062 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint64
CUB:  147.43299; CuPy:  147.43610 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float16
CUB:  141.54595; CuPy:  141.49002 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float32
CUB:  149.38880; CuPy:  149.31869 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float64
CUB:  161.41210; CuPy:  161.38314 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex64
CUB:  175.68266; CuPy:  175.64086 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_2.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.79536; CuPy:  109.64730 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int8
CUB:    0.82611; CuPy:  114.18134 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int16
CUB:    1.19994; CuPy:  114.68288 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int32
CUB:    2.17834; CuPy:  124.61891 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int64
CUB:    0.69245; CuPy:  108.88109 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint8
CUB:    0.82675; CuPy:  113.64685 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint16
CUB:    1.19978; CuPy:  114.63072 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint32
CUB:    2.17706; CuPy:  124.69686 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint64
CUB:    0.82605; CuPy:  138.19821 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float16
CUB:    1.22186; CuPy:  132.35994 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float32
CUB:    2.26493; CuPy:  142.84656 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float64
CUB:    2.14358; CuPy:  165.03171 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex64
CUB:    3.76851; CuPy:  176.57674 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex128
CUB:    2.58307; CuPy:    1.78470 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int8
CUB:    2.70896; CuPy:    1.87226 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int16
CUB:    3.05891; CuPy:    1.94067 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int32
CUB:    3.87389; CuPy:    2.39798 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int64
CUB:    2.58166; CuPy:    1.77190 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint8
CUB:    2.69571; CuPy:    1.87139 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint16
CUB:    3.10067; CuPy:    1.94720 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint32
CUB:    3.93286; CuPy:    2.40890 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint64
CUB:    2.75085; CuPy:    2.06138 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float16
CUB:    3.12355; CuPy:    2.06374 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float32
CUB:    3.90243; CuPy:    2.46022 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float64
CUB:    3.94880; CuPy:    2.58288 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex64
CUB:    5.54752; CuPy:    3.75248 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex128
CUB:    4.09427; CuPy:    5.20714 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int8
CUB:    4.07987; CuPy:    5.11590 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int16
CUB:    3.77520; CuPy:    5.10714 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int32
CUB:    3.81616; CuPy:    7.35078 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int64
CUB:    4.01322; CuPy:    5.05654 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint8
CUB:    3.94768; CuPy:    5.07296 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint16
CUB:    3.79437; CuPy:    5.11494 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint32
CUB:    3.90106; CuPy:    7.36154 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint64
CUB:    3.94240; CuPy:    5.87226 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float16
CUB:    3.85715; CuPy:    5.43440 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float32
CUB:    3.86035; CuPy:    7.94909 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float64
CUB:    4.46806; CuPy:    9.52746 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex64
CUB:    5.55808; CuPy:   15.46774 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex128
____________________________________________________ TestCUBperformance_param_2.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.79475; CuPy:  109.61091 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int8
CUB:    0.82963; CuPy:  114.17046 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int16
CUB:    1.19942; CuPy:  114.63987 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int32
CUB:    2.17830; CuPy:  124.65651 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int64
CUB:    0.68835; CuPy:  108.92672 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint8
CUB:    0.82627; CuPy:  113.65392 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint16
CUB:    1.22131; CuPy:  114.66429 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint32
CUB:    2.17200; CuPy:  124.64541 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint64
CUB:    0.83299; CuPy:  138.14150 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float16
CUB:    1.23210; CuPy:  132.35865 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float32
CUB:    2.26643; CuPy:  142.79750 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float64
CUB:    2.13994; CuPy:  159.51024 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex64
CUB:    3.75984; CuPy:  169.65226 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex128
CUB:    2.57882; CuPy:    1.78784 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int8
CUB:    2.69850; CuPy:    1.87146 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int16
CUB:    3.06496; CuPy:    1.95360 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int32
CUB:    3.88205; CuPy:    2.39389 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int64
CUB:    2.57718; CuPy:    1.76109 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint8
CUB:    2.70675; CuPy:    1.87229 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint16
CUB:    3.06662; CuPy:    1.94438 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint32
CUB:    3.85456; CuPy:    2.39174 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint64
CUB:    2.86250; CuPy:    2.08182 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float16
CUB:    3.20019; CuPy:    2.09178 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float32
CUB:    3.92970; CuPy:    2.49178 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float64
CUB:    3.93955; CuPy:    2.57405 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex64
CUB:    5.44592; CuPy:    3.74381 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex128
CUB:    4.08224; CuPy:    5.20723 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   [114/853]
CUB:    4.06861; CuPy:    5.11341 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int16
CUB:    3.79472; CuPy:    5.11277 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int32
CUB:    3.89171; CuPy:    7.37050 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int64
CUB:    4.00422; CuPy:    5.04486 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint8
CUB:    3.95283; CuPy:    5.06582 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint16
CUB:    3.79718; CuPy:    5.12528 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint32
CUB:    3.82822; CuPy:    7.34659 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint64
CUB:    3.92851; CuPy:    5.87062 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float16
CUB:    3.87251; CuPy:    5.43318 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float32
CUB:    3.84627; CuPy:    7.95293 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float64
CUB:    4.56762; CuPy:    9.32144 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex64
CUB:    5.65837; CuPy:   15.50704 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex128
____________________________________________________ TestCUBperformance_param_2.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    2.22928; CuPy:  120.01379 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int64
CUB:    2.22182; CuPy:  120.01667 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint64
CUB:    0.86310; CuPy:  110.06707 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float16
CUB:    1.23251; CuPy:  114.50406 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float32
CUB:    2.20778; CuPy:  120.33984 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float64
CUB:    2.02397; CuPy:  120.82045 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex64
CUB:    5.24275; CuPy:  130.40291 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex128
CUB:    3.96653; CuPy:    2.46886 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int64
CUB:    3.98042; CuPy:    2.47629 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint64
CUB:    2.88390; CuPy:    1.92173 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float16
CUB:    3.12998; CuPy:    1.89600 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float32
CUB:    3.90038; CuPy:    2.46147 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float64
CUB:    3.91110; CuPy:    2.46979 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex64
CUB:    5.57315; CuPy:    3.64250 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex128
CUB:    3.84886; CuPy:    4.53485 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int64
CUB:    3.85078; CuPy:    4.53085 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint64
CUB:    3.88851; CuPy:    4.24058 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float16
CUB:    3.71040; CuPy:    4.28352 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float32
CUB:    3.90730; CuPy:    4.43290 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float64
CUB:    3.92058; CuPy:    4.45674 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex64
CUB:    6.23174; CuPy:    4.95997 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex128
___________________________________________________ TestCUBperformance_param_3.test_cub_argmax_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  527.97558; CuPy:  527.95715 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int8
CUB:  525.55520; CuPy:  525.49603 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int16
CUB:  529.93830; CuPy:  529.75583 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int32
CUB:  562.13997; CuPy:  561.85366 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int64
CUB:  506.65613; CuPy:  506.67002 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint8
CUB:  533.61638; CuPy:  533.39385 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint16
CUB:  529.67911; CuPy:  529.83818 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint32
CUB:  561.93136; CuPy:  562.07837 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint64
CUB:  589.31929; CuPy:  589.33958 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float16
CUB:  566.21408; CuPy:  566.17069 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float32
CUB:  596.60586; CuPy:  596.62045 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float64
CUB:  644.10003; CuPy:  643.99671 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex64
CUB:  701.16557; CuPy:  701.17552 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_3.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  528.00950; CuPy:  527.93328 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int8
CUB:  525.51120; CuPy:  525.61997 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int16
CUB:  529.80819; CuPy:  529.69127 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int32
CUB:  561.99734; CuPy:  561.89549 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int64
CUB:  506.76112; CuPy:  506.62019 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint8
CUB:  533.42521; CuPy:  533.53350 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint16
CUB:  529.84861; CuPy:  529.61773 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint32
CUB:  561.93843; CuPy:  561.95257 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint64
CUB:  589.66528; CuPy:  589.50102 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float16
CUB:  566.23200; CuPy:  566.16836 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float32
CUB:  596.59635; CuPy:  596.61255 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float64
CUB:  644.10836; CuPy:  644.16418 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex64
CUB:  701.15526; CuPy:  701.25541 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_3.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    1.26438; CuPy:  442.40794 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int8
CUB:    1.96602; CuPy:  459.81024 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int16
CUB:    3.46208; CuPy:  460.51927 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int32
CUB:    6.90336; CuPy:  498.71206 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int64
CUB:    1.24381; CuPy:  439.17524 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint8
CUB:    1.97056; CuPy:  456.38557 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint16
CUB:    3.46784; CuPy:  460.37885 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint32
CUB:    6.90253; CuPy:  498.61936 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint64
CUB:    1.97926; CuPy:  552.34064 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float16
CUB:    3.47296; CuPy:  528.73184 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float32
CUB:    7.05555; CuPy:  570.07392 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float64
CUB:    6.75261; CuPy:  659.06864 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex64
CUB:   13.35094; CuPy:  704.52774 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex128
CUB:   36.11469; CuPy:  134.60166 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int8
CUB:   37.30826; CuPy:  139.34685 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int16
CUB:   22.65994; CuPy:  140.33334 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int32
CUB:   73.78838; CuPy:  147.19219 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int64
CUB:   36.12944; CuPy:  133.76592 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint8
CUB:   37.10637; CuPy:  138.41779 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint16
CUB:   22.70163; CuPy:  140.25296 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint32
CUB:   73.86861; CuPy:  147.29437 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint64
CUB:   39.56518; CuPy:  160.66512 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float16
CUB:   22.66662; CuPy:  156.51443 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float32
CUB:   75.58477; CuPy:  161.60963 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float64
CUB:  105.85056; CuPy:  186.22461 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex64
CUB:  257.67456; CuPy:  204.04099 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex128
CUB:    3.16291; CuPy:    5.38061 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int8
CUB:    3.83200; CuPy:    5.71939 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int16
CUB:    5.34976; CuPy:    5.93520 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int32
CUB:    8.36838; CuPy:    7.89072 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int64
CUB:    3.18112; CuPy:    5.32762 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint8
CUB:    3.81165; CuPy:    5.70621 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint16
CUB:    5.35690; CuPy:    6.37478 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint32
CUB:    8.45866; CuPy:    8.19680 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint64
CUB:    3.92304; CuPy:    7.10093 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float16
CUB:    5.38653; CuPy:    7.01114 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float32
CUB:    8.47747; CuPy:    8.52400 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float64
CUB:    8.57514; CuPy:    9.25642 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex64
CUB:   14.72294; CuPy:   14.28918 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex128
CUB:   10.61638; CuPy:   21.57011 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int8
CUB:   10.65123; CuPy:   21.11382 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int16
CUB:    9.04826; CuPy:   21.07725 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int32
CUB:    8.51146; CuPy:   31.20365 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int64
CUB:   10.38902; CuPy:   20.84662 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint8
CUB:    9.96221; CuPy:   20.87440 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint16
CUB:    9.05613; CuPy:   21.07322 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint32
CUB:    8.51958; CuPy:   31.20442 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint64
CUB:    9.86390; CuPy:   24.57258 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float16
CUB:    9.21978; CuPy:   22.56656 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float32
CUB:    8.58342; CuPy:   33.86915 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float64
CUB:   11.57357; CuPy:   40.93101 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex64
CUB:   15.58432; CuPy:   67.68730 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex128
____________________________________________________ TestCUBperformance_param_3.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    1.26621; CuPy:  444.85636 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int8
CUB:    1.96550; CuPy:  459.95104 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int16
CUB:    3.47942; CuPy:  460.99185 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int32
CUB:    6.90966; CuPy:  499.00960 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int64
CUB:    1.23322; CuPy:  439.44874 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint8
CUB:    1.96218; CuPy:  456.77056 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint16
CUB:    3.46624; CuPy:  460.78016 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint32
CUB:    6.90621; CuPy:  498.87200 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint64
CUB:    1.97654; CuPy:  553.05798 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float16
CUB:    3.48227; CuPy:  529.49190 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float32
CUB:    7.05603; CuPy:  570.83306 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float64
CUB:    6.74070; CuPy:  636.96307 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex64
CUB:   13.34067; CuPy:  678.18134 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex128
CUB:   36.12374; CuPy:  134.64118 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int8
CUB:   37.29402; CuPy:  139.42710 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int16
CUB:   22.68195; CuPy:  140.33981 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int32
CUB:   73.98429; CuPy:  147.24838 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int64
CUB:   36.25930; CuPy:  133.89366 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint8
CUB:   37.17056; CuPy:  138.46640 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint16
CUB:   22.74227; CuPy:  140.32512 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint32
CUB:   73.84886; CuPy:  147.25933 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint64
CUB:   39.61910; CuPy:  160.72874 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float16
CUB:   22.60314; CuPy:  156.61958 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float32
CUB:   75.53373; CuPy:  161.67779 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float64
CUB:   96.40064; CuPy:  179.52109 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex64
CUB:  238.31469; CuPy:  196.89110 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex128
CUB:    3.11674; CuPy:    5.35654 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int8
CUB:    3.85088; CuPy:    5.73456 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int16
CUB:    5.32435; CuPy:    5.93517 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int32
CUB:    8.36749; CuPy:    7.88176 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int64
CUB:    3.09712; CuPy:    5.31069 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint8
CUB:    3.83658; CuPy:    5.70464 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint16
CUB:    5.46026; CuPy:    5.93856 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint32
CUB:    8.38349; CuPy:    7.89491 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint64
CUB:    3.94064; CuPy:    6.49104 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float16
CUB:    5.37296; CuPy:    6.47002 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float32
CUB:    8.40042; CuPy:    8.15654 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float64
CUB:    8.52947; CuPy:    8.55504 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex64
CUB:   14.65722; CuPy:   13.84944 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex128
CUB:    9.80493; CuPy:   19.23325 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int8
CUB:    9.76560; CuPy:   18.86701 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int16
CUB:    8.52282; CuPy:   18.83930 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int32
CUB:    8.54643; CuPy:   27.71821 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int64
CUB:    9.54755; CuPy:   18.60224 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint8
CUB:    9.21558; CuPy:   18.66512 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint16
CUB:    9.24413; CuPy:   21.67030 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint32
CUB:    8.62368; CuPy:   32.13984 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint64
CUB:   10.03130; CuPy:   25.24800 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float16
CUB:    9.42534; CuPy:   23.17302 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float32
CUB:    8.57414; CuPy:   34.83206 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float64
CUB:   12.11955; CuPy:   41.14906 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex64
CUB:   15.42637; CuPy:   63.56301 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex128
____________________________________________________ TestCUBperformance_param_3.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    6.97610; CuPy:  480.08227 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int64
CUB:    6.94179; CuPy:  480.14163 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint64
CUB:    1.99718; CuPy:  447.76698 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float16
CUB:    3.47312; CuPy:  459.71120 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float32
CUB:    6.92179; CuPy:  481.45302 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float64
CUB:    6.77766; CuPy:  483.35837 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex64
CUB:   14.80320; CuPy:  521.10055 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex128
CUB:   70.57578; CuPy:  147.89293 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int64
CUB:   70.54896; CuPy:  147.95360 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint64
CUB:   42.92646; CuPy:  137.99280 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float16
CUB:   21.15168; CuPy:  140.78490 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float32
CUB:   70.59066; CuPy:  145.67962 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float64
CUB:   72.87347; CuPy:  145.48374 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex64
CUB:  266.10330; CuPy:  152.96611 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex128
CUB:    8.38550; CuPy:    7.79805 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int64
CUB:    8.43165; CuPy:    7.81923 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint64
CUB:    3.92800; CuPy:    5.62800 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float16
CUB:    5.38384; CuPy:    5.87661 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float32
CUB:    8.45389; CuPy:    7.76954 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float64
CUB:    8.45251; CuPy:    7.78051 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex64
CUB:   14.53507; CuPy:   13.17850 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex128
CUB:    8.75450; CuPy:   16.49021 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int64
CUB:    8.87261; CuPy:   16.48438 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint64
CUB:    8.94595; CuPy:   15.36790 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float16
CUB:    7.98595; CuPy:   15.48554 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float32
CUB:    8.58381; CuPy:   16.06954 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float64
CUB:    8.83651; CuPy:   16.16109 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex64
CUB:   17.87510; CuPy:   18.14227 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex128
================================================================== short test summary info ===================================================================

leofang · 2019-11-04T05:04:19Z

Thanks for the quick tests.

half-precision is supported since CUDA 7.5 (Compute Capability 5.3 and above). I am wondering that CUDA_VERSION (i.e. CUDACC_VER_MAJOR ) can be tested on host-code (dtype dispatcher) but CC can be tested on device-code only (i.e. inside of CUB) so we cannot handle this checking using dtype dispacher only. What do you think?

I think CUB chose __CUDACC_VER_MAJOR__ >= 9 because of this reason:

The new behavior is implemented by using C++ operator overloads defined in the cuda_fp16.h header file. Note that the new types are incompatible with the previous CUDA 8.0 definitions.

(from https://docs.nvidia.com/cuda/archive/9.0/cuda-toolkit-release-notes/index.html). So it might be possible that CUB relies on this behavior, but unfortunately I don't have a CUDA 8.0 env to test this myself.

I think for the dtype dispatcher in cupy_cub.cu, you can just test __CUDACC_VER_MAJOR__ >= 9 for case CUPY_CUB_FLOAT16. In cub.pyx, you can then add a runtime test to decide if you wanna add numpy.float16 to the supported dtype lists. This should be enough to ensure we don't invoke CUB for float16.

For the CUDA versions that support __half, CUB already specialized a NumericTraits for it, so I don't see why we need to include <cupy/carray.cuh> and to duplicate the effort. Can you test if the native CUB support for __half just works?

Yes, it works on my environment (CUDA and CC is recent version).

Oh that's great! So you can remove the template specializations, get rid of cupy/carray.cuh, and just use __half? Can you push the commit?

My implementation cannot pass the TestCUBreduction_param_(0,1,2,3) test of summation.
min/max are working correctly so the problem is tolerance.

Yeah it's possible, don't worry about correctness. It's hard to get agreement for float16 for sum(). I was just curious about how it performs compared to CuPy's reduction kernel.

I think fp16-reduction and fp32-reduction is same performance.

The situation is mixed. I guess it's likely the the test sizes are too small to see real performance of fp16.

y1r · 2019-11-04T05:51:07Z

I've updated this PR based on your comment.
Changes:

Use __half instead of float16
#if __CUDACC_VER_MAJOR__ >= 9 && (__CUDA_ARCH__ >= 530 || !defined(__CUDA_ARCH__)) to check __half code should be compiled. without __CUDA_ARCH__ checking, cupy build with cc3.0 will be failed.
Add runtime check of cc on cub.pyx

y1r · 2019-11-04T05:59:27Z

Please wait merge. current head cannot use cub. I'll investigate this.

y1r · 2019-11-04T08:17:39Z

I fixed the issue.

leofang

@y1r Thanks for the fast updates and for catching the issue of supported CUDA architectures. I wasn't aware of it (I knew almost nothing about half precision...). I just left a few comments.

cupy/cuda/cub.pyx

leofang

LGTM as long as test_cub.py can be passed (apart from the sum failure for float16 of course). @kmaehashi probably can provide you better, more pythonic ways for making the needed changes, so please expect inputs from him. 🙂 Thank you again for working on this @y1r.

leofang · 2019-11-08T03:39:34Z

cupy/cuda/cub.pyx

+
+
+cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None):
+    dev_id = device.get_device_id()


@y1r I just found that every cupy.ndarray has an attribute device:

>>> import cupy as cp >>> a = cp.arange(10) >>> a.device <CUDA Device 0> >>> a.device.id 0

...and the underlying MemoryPointer also stores this information:

>>> a.data <cupy.cuda.memory.MemoryPointer object at 0x7f5b41c9e120> >>> a.data.device <CUDA Device 0> >>> a.data.device_id 0

Do you think it'd be easier to just use it? It'd involve changing a few function signatures from bottom up, so that the dev id can be passed from a higher level down. Could be a bit ugly.

I am not saying we should do this, just want to share this new finding. Should have known this earlier...😛

jakirkham · 2019-11-08T17:29:06Z

cc @anaruse (who may be interested in this)

leofang · 2019-11-18T15:34:15Z

@y1r Note that after the bug fix #2636 you need to provide template specializations to handle fp16 NaNs. It should be fairly straightforward.

y1r · 2019-11-18T15:39:45Z

@leofang Thanks for the info! I'll fix.

y1r · 2019-12-03T10:54:03Z

@leofang I've updated my PR based on your suggestion.
isnan() is not available for __half, so I used __hisnan for device code and casting to float for host code (I think the performance degradation is small)

cupy/cuda/cupy_cub.cu

leofang · 2019-12-03T18:31:20Z

@grlee77 Given that your recent refactoring work on CUB just went into upstream, could you help review this PR?

leofang · 2019-12-03T19:25:37Z

Given that your recent refactoring work on CUB just went into upstream

Sorry, it was your sparse matrix support that got merged, not the refactoring...

y1r · 2019-12-09T05:11:31Z

@leofang Refactored my implementation. Could you give me a review again?

leofang

LGTM except for one nitpick. Thank you @y1r for the improvements!

cupy/cuda/cupy_cub.cu

leofang

LGTM!

niboshi · 2020-02-10T08:13:58Z

Sorry for delay.
LGTM but could you resolve conflict?

leofang · 2020-02-12T05:03:44Z

@y1r If you need help to resolve conflicts, you can add me as a collaborator to your repo.

y1r · 2020-02-16T08:01:45Z

Thank you for checking my PR. I'll resolve the conflict soon. Sorry for the late reaction.

y1r · 2020-02-17T03:57:34Z

@niboshi @leofang I updated my PR. Please review this.

niboshi · 2020-02-18T01:11:05Z

Thank you!
Jenkins, test this please

pfn-ci-bot · 2020-02-18T01:11:10Z

Successfully created a job for commit 7bc6d62:

Dashboard for commit 7bc6d62

chainer-ci · 2020-02-18T03:52:05Z

Jenkins CI test (for commit 7bc6d62, target branch master) succeeded!

Add fp16 support of CUB

81cf379

leofang reviewed Nov 4, 2019

View reviewed changes

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

Use __half instead of float16

3ecb297

Avoid append fp16 to global variable

d690f7f

bugfix

daf7e84

leofang requested changes Nov 4, 2019

View reviewed changes

cupy/cuda/cub.pyx Outdated Show resolved Hide resolved

leofang added a commit to leofang/cupy that referenced this pull request Nov 4, 2019

Prepare for float16 support in cupy#2600

5724216

asi1024 assigned kmaehashi Nov 5, 2019

y1r added 2 commits November 6, 2019 14:09

Add _cub_support_dtype function

d717a3d

Fix flake8

e25ce7f

leofang approved these changes Nov 6, 2019

View reviewed changes

y1r mentioned this pull request Nov 6, 2019

Reduce cudaGetDevice calls #2608

Closed

leofang reviewed Nov 8, 2019

View reviewed changes

leofang mentioned this pull request Nov 19, 2019

Discussion for possible enhancements of the new CUB support #2519

Closed

10 tasks

y1r added 4 commits December 3, 2019 17:26

Merge branch 'master' into cub_fp16_support

ad71d47

Add NaN handling of half

26d590d

Use __hisnan

696e549

Add __host__ code for nan check of half

03b6033

leofang reviewed Dec 3, 2019

View reviewed changes

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

Refactor fp16 comparison

21e99e2

leofang reviewed Dec 10, 2019

View reviewed changes

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

cupy/cuda/cupy_cub.cu Outdated Show resolved Hide resolved

Reflected review comment

188f350

leofang approved these changes Dec 13, 2019

View reviewed changes

kmaehashi added this to the v8.0.0a1 milestone Jan 28, 2020

kmaehashi assigned niboshi and unassigned kmaehashi Feb 6, 2020

niboshi added the st:awaiting-author Awaiting response from author label Feb 13, 2020

emcastillo modified the milestones: v8.0.0a1, v8.0.0b1 Feb 14, 2020

Merge branch 'master' into cub_fp16_support

7bc6d62

niboshi added cat:enhancement Improvements to existing features st:test-and-merge (deprecated) Ready to merge after test pass. and removed st:awaiting-author Awaiting response from author labels Feb 18, 2020

mergify bot merged commit 74f551f into cupy:master Feb 18, 2020

This was referenced Feb 18, 2020

Build the cupy.cuda.cub module by default #2584

Merged

Fix compiler bug when building cupy.cuda.cub with CUDA < 9.2 #3089

Merged

leofang mentioned this pull request Jul 16, 2020

Fix float16 ndarray input in histogram with CUB #3617

Merged



		cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None):
		dev_id = device.get_device_id()

Uh oh!

Conversation

y1r commented Nov 4, 2019

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

y1r commented Nov 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Nov 4, 2019

Uh oh!

y1r commented Nov 4, 2019

Uh oh!

y1r commented Nov 4, 2019

Uh oh!

y1r commented Nov 4, 2019

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang Nov 8, 2019

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Nov 8, 2019

Uh oh!

leofang commented Nov 18, 2019

Uh oh!

y1r commented Nov 18, 2019

Uh oh!

y1r commented Dec 3, 2019

Uh oh!

Uh oh!

leofang commented Dec 3, 2019

Uh oh!

leofang commented Dec 3, 2019

Uh oh!

y1r commented Dec 9, 2019

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

niboshi commented Feb 10, 2020

Uh oh!

leofang commented Feb 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

y1r commented Feb 16, 2020

Uh oh!

y1r commented Feb 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niboshi commented Feb 18, 2020

Uh oh!

pfn-ci-bot commented Feb 18, 2020

Uh oh!

chainer-ci commented Feb 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

y1r commented Nov 4, 2019 •

edited

Loading

leofang commented Feb 12, 2020 •

edited

Loading

y1r commented Feb 17, 2020 •

edited

Loading