Skip to content

Add fp16 support of CUB#2600

Merged
mergify[bot] merged 13 commits intocupy:masterfrom
y1r:cub_fp16_support
Feb 18, 2020
Merged

Add fp16 support of CUB#2600
mergify[bot] merged 13 commits intocupy:masterfrom
y1r:cub_fp16_support

Conversation

@y1r
Copy link
Copy Markdown
Contributor

@y1r y1r commented Nov 4, 2019

Currently, the CUB binding of CuPy doesn't support fp16 but CUB supports it.

I implemented fp16 support of CUB binding by using float16 class on carray.cuh.
I am not sure what class I should use for this implementation, so please give me comments.

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for filling in timely this missing support. I wanted to try it myself, but decided that it's not my priority.😅

I have three questions on this PR:

  1. As far as I can tell -- correct me if I am wrong -- the support of half precision (__half) begins only in recent CUDA versions (since 9.0 I suppose? Not sure why CuPy wants to bundle cuda_fp16.h for 9.2 and above...). If we want to enable CUB, we should test if __CUDACC_VER_MAJOR__ >= 9, and drop the support for, say, CUDA 8.0. And this needs to be done in both the C++ and the Python codes.
  2. For the CUDA versions that support __half, CUB already specialized a NumericTraits for it, so I don't see why we need to include <cupy/carray.cuh> and to duplicate the effort. Can you test if the native CUB support for __half just works?
  3. I am interested in the performance gain for float16. Can you modify the test in #2598 and show the performance comparison for it? (You just need to modify the strings lLfdFD to lLefdFD and bhilBHILfdFD to bhilBHILefdFD there, i.e., adding 'e'. Run it using python -m pytest -rA -v ... to see the full stdout.)

Thanks again.

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Nov 4, 2019

Thank you for the quick response.

  1. As far as I can tell -- correct me if I am wrong -- the support of half precision (__half) begins only in recent CUDA versions (since 9.0 I suppose? Not sure why CuPy wants to bundle cuda_fp16.h for 9.2 and above...). If we want to enable CUB, we should test if CUDACC_VER_MAJOR >= 9, and drop the support for, say, CUDA 8.0. And this needs to be done in both the C++ and the Python codes.

half-precision is supported since CUDA 7.5 (Compute Capability 5.3 and above). I am wondering that CUDA_VERSION (i.e. __CUDACC_VER_MAJOR__ ) can be tested on host-code (dtype dispatcher) but CC can be tested on device-code only (i.e. inside of CUB) so we cannot handle this checking using dtype dispacher only. What do you think?

  1. For the CUDA versions that support __half, CUB already specialized a NumericTraits for it, so I don't see why we need to include <cupy/carray.cuh> and to duplicate the effort. Can you test if the native CUB support for __half just works?

Yes, it works on my environment (CUDA and CC is recent version, V100 SXM2 16GB).

  1. I am interested in the performance gain for float16. Can you modify the test in Add tests for cupy.cuda.cub #2598 and show the performance comparison for it? (You just need to modify the strings lLfdFD to lLefdFD and bhilBHILfdFD to bhilBHILefdFD there, i.e., adding 'e'. Run it using python -m pytest -rA -v ... to see the full stdout.)

My implementation cannot pass the TestCUBreduction_param_(0,1,2,3) test of summation.
min/max are working correctly so the problem is tolerance.
The performance:
I think fp16-reduction and fp32-reduction is same performance.

___________________________________________________ TestCUBperformance_param_0.test_cub_argmax_performance ___________________________________________________                                                                                                                                                       [444/913]
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.53638; CuPy:    0.44403 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int8
CUB:    0.46560; CuPy:    0.43827 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int16
CUB:    0.46755; CuPy:    0.44243 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int32
CUB:    0.46566; CuPy:    0.44173 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int64
CUB:    0.45581; CuPy:    0.44032 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint8
CUB:    0.46182; CuPy:    0.44304 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint16
CUB:    0.49322; CuPy:    0.46234 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint32
CUB:    0.47968; CuPy:    0.43965 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint64
CUB:    0.46576; CuPy:    0.44080 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float16
CUB:    0.45779; CuPy:    0.43766 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float32
CUB:    0.46157; CuPy:    0.43888 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float64
CUB:    0.46051; CuPy:    0.44883 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex64
CUB:    0.49443; CuPy:    0.47430 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_0.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.49427; CuPy:    0.45098 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int8
CUB:    0.49088; CuPy:    0.44838 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int16
CUB:    0.47494; CuPy:    0.44950 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int32
CUB:    0.48454; CuPy:    0.45344 (ms), for 20 runs, shape=(1024,), axis=None, dtype=int64
CUB:    0.48560; CuPy:    0.45139 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint8
CUB:    0.47238; CuPy:    0.44330 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint16
CUB:    0.46390; CuPy:    0.45165 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint32
CUB:    0.46810; CuPy:    0.44618 (ms), for 20 runs, shape=(1024,), axis=None, dtype=uint64
CUB:    0.46618; CuPy:    0.44614 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float16
CUB:    0.46320; CuPy:    0.44643 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float32
CUB:    0.46490; CuPy:    0.44960 (ms), for 20 runs, shape=(1024,), axis=None, dtype=float64
CUB:    0.46752; CuPy:    0.45152 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex64
CUB:    0.50115; CuPy:    0.48397 (ms), for 20 runs, shape=(1024,), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_0.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.42752; CuPy:  182.88009 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int8
CUB:    0.46131; CuPy:  181.48398 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int16
CUB:    0.42496; CuPy:  179.44694 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int32
CUB:    0.46544; CuPy:  187.05462 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int64
CUB:    0.43245; CuPy:  183.22202 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint8
CUB:    0.45005; CuPy:  181.92096 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint16
CUB:    0.44909; CuPy:  181.16878 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint32
CUB:    0.47280; CuPy:  185.60614 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint64
CUB:    0.47437; CuPy:  186.01673 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float16
CUB:    0.47293; CuPy:  183.08029 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float32
CUB:    0.48160; CuPy:  185.79357 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float64
CUB:    0.48534; CuPy:  194.45978 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex64
CUB:    0.49210; CuPy:  198.73137 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex128
____________________________________________________ TestCUBperformance_param_0.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.46784; CuPy:  181.68195 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int8
CUB:    0.45120; CuPy:  184.16393 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int16
CUB:    0.45786; CuPy:  181.88711 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int32
CUB:    0.47843; CuPy:  187.09404 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int64
CUB:    0.33104; CuPy:  176.07049 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint8
CUB:    0.35632; CuPy:  176.40704 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint16
CUB:    0.35485; CuPy:  173.97667 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint32
CUB:    0.35962; CuPy:  180.57930 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint64
CUB:    0.37786; CuPy:  178.41405 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float16
CUB:    0.38483; CuPy:  174.80364 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float32
CUB:    0.37920; CuPy:  184.11764 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float64
CUB:    0.38256; CuPy:  187.11802 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex64
CUB:    0.38874; CuPy:  191.57011 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex128
____________________________________________________ TestCUBperformance_param_0.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.39245; CuPy:  174.21158 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=int64
CUB:    0.35926; CuPy:  171.72423 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=uint64
CUB:    0.38867; CuPy:  149.16666 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float16
CUB:    0.37680; CuPy:  148.32182 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float32
CUB:    0.40358; CuPy:  170.23859 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=float64
CUB:    0.38755; CuPy:  174.08566 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex64
CUB:    0.38851; CuPy:  174.96090 (ms), for 20 runs, shape=(1024,), axis=(0,), dtype=complex128
___________________________________________________ TestCUBperformance_param_1.test_cub_argmax_performance ___________________________________________________                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [315/853]
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    3.36477; CuPy:    3.31962 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int8
CUB:    3.19670; CuPy:    3.17776 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int16
CUB:    3.15034; CuPy:    3.14899 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int32
CUB:    3.38701; CuPy:    3.36918 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int64
CUB:    3.28326; CuPy:    3.28314 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint8
CUB:    3.19168; CuPy:    3.19203 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint16
CUB:    3.14339; CuPy:    3.13946 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint32
CUB:    3.37674; CuPy:    3.37280 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint64
CUB:    3.78746; CuPy:    3.78621 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float16
CUB:    3.51754; CuPy:    3.51306 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float32
CUB:    3.72109; CuPy:    3.71888 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float64
CUB:    4.24086; CuPy:    4.22317 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex64
CUB:    4.03082; CuPy:    4.01472 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_1.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    2.93709; CuPy:    2.89360 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int8
CUB:    2.82720; CuPy:    2.78515 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int16
CUB:    2.75728; CuPy:    2.76499 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int32
CUB:    2.97498; CuPy:    2.95437 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=int64
CUB:    2.86819; CuPy:    2.89011 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint8
CUB:    2.80438; CuPy:    2.79872 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint16
CUB:    2.75459; CuPy:    2.74749 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint32
CUB:    2.96234; CuPy:    2.95626 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=uint64
CUB:    3.30410; CuPy:    3.29382 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float16
CUB:    3.06954; CuPy:    3.07091 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float32
CUB:    3.25411; CuPy:    3.23779 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=float64
CUB:    3.68048; CuPy:    3.68320 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex64
CUB:    3.99610; CuPy:    3.99715 (ms), for 20 runs, shape=(256, 1024), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_1.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.74717; CuPy:    2.16448 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int8
CUB:    0.43488; CuPy:    2.16003 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int16
CUB:    0.42330; CuPy:    2.17318 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int32
CUB:    0.43594; CuPy:    2.45549 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int64
CUB:    0.42259; CuPy:    2.15184 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint8
CUB:    0.42362; CuPy:    2.12723 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint16
CUB:    0.42509; CuPy:    2.16886 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint32
CUB:    0.43232; CuPy:    2.45277 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint64
CUB:    0.43968; CuPy:    2.93117 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float16
CUB:    0.44861; CuPy:    2.67808 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float32
CUB:    0.44102; CuPy:    3.04653 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float64
CUB:    0.44710; CuPy:    3.81840 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex64
CUB:    0.46397; CuPy:    4.06406 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex128
CUB:    2.38502; CuPy:  197.75564 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int8
CUB:    2.34384; CuPy:  195.23818 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int16
CUB:    2.30403; CuPy:  195.72090 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int32
CUB:    2.27952; CuPy:  195.45110 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int64
CUB:    2.38899; CuPy:  184.30937 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint8
CUB:    2.35277; CuPy:  186.84455 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint16
CUB:    2.35219; CuPy:  185.43959 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint32
CUB:    2.25824; CuPy:  190.19331 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint64
CUB:    2.34480; CuPy:  160.54877 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float16
CUB:    2.22499; CuPy:  188.62803 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float32
CUB:    2.28282; CuPy:  191.49294 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float64
CUB:    2.35168; CuPy:  199.99507 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex64
CUB:    2.13699; CuPy:  205.91600 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex128
____________________________________________________ TestCUBperformance_param_1.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.74886; CuPy:    2.17565 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int8
CUB:    0.43926; CuPy:    2.14573 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int16
CUB:    0.41606; CuPy:    2.16694 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int32
CUB:    0.44640; CuPy:    2.45328 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int64
CUB:    0.40560; CuPy:    2.14442 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint8
CUB:    0.41987; CuPy:    2.34758 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint16
CUB:    0.41421; CuPy:    2.46688 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint32
CUB:    0.42144; CuPy:    2.79453 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint64
CUB:    0.41859; CuPy:    3.34858 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float16
CUB:    0.42816; CuPy:    3.06381 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float32
CUB:    0.42864; CuPy:    3.49830 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float64
CUB:    0.44739; CuPy:    4.18598 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex64
CUB:    0.50118; CuPy:    4.46157 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex128
CUB:    2.18883; CuPy:  187.48228 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int8
CUB:    2.33629; CuPy:  186.00631 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int16
CUB:    2.67222; CuPy:  194.68614 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int32
CUB:    2.24550; CuPy:  189.38477 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int64
CUB:    2.37840; CuPy:  187.92944 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint8
CUB:    2.37398; CuPy:  187.73958 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint16
CUB:    2.32874; CuPy:  185.54112 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint32
CUB:    2.31229; CuPy:  189.36177 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint64
CUB:    2.69846; CuPy:  196.61235 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float16
CUB:    2.71011; CuPy:  195.82343 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float32
CUB:    2.64083; CuPy:  197.55507 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float64
CUB:    2.54042; CuPy:  208.71427 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex64
CUB:    2.50986; CuPy:  210.69037 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex128
____________________________________________________ TestCUBperformance_param_1.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.75827; CuPy:    2.65552 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=int64
CUB:    0.54611; CuPy:    2.64861 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=uint64
CUB:    0.55434; CuPy:    2.42387 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float16
CUB:    0.52058; CuPy:    2.47011 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float32
CUB:    0.53376; CuPy:    2.65155 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=float64
CUB:    0.52707; CuPy:    2.65178 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex64
CUB:    0.55286; CuPy:    2.93974 (ms), for 20 runs, shape=(256, 1024), axis=(0, 1), dtype=complex128
CUB:    2.41578; CuPy:  189.06758 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=int64
CUB:    2.57014; CuPy:  187.65114 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=uint64
CUB:    2.77382; CuPy:  160.22694 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float16
CUB:    2.72045; CuPy:  158.73507 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float32
CUB:    2.65936; CuPy:  187.65065 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=float64
CUB:    2.58074; CuPy:  191.26474 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex64
CUB:    2.47718; CuPy:  190.82137 (ms), for 20 runs, shape=(256, 1024), axis=(1,), dtype=complex128
___________________________________________________ TestCUBperformance_param_2.test_cub_argmax_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  144.74288; CuPy:  141.11178 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int8
CUB:  131.39590; CuPy:  131.29507 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int16
CUB:  132.54224; CuPy:  132.50589 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int32
CUB:  140.44058; CuPy:  140.42630 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int64
CUB:  126.29322; CuPy:  126.32611 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint8
CUB:  132.92016; CuPy:  132.77961 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint16
CUB:  132.52570; CuPy:  132.50493 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint32
CUB:  140.42125; CuPy:  140.41130 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint64
CUB:  147.42102; CuPy:  147.37606 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float16
CUB:  141.54406; CuPy:  141.52698 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float32
CUB:  149.32192; CuPy:  149.27712 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float64
CUB:  161.37901; CuPy:  161.32704 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex64
CUB:  175.62746; CuPy:  175.64835 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_2.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  131.97466; CuPy:  132.08854 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int8
CUB:  131.31168; CuPy:  131.30867 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int16
CUB:  132.50131; CuPy:  132.48125 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int32
CUB:  140.47920; CuPy:  140.56077 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=int64
CUB:  126.29142; CuPy:  126.32806 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint8
CUB:  132.76314; CuPy:  132.79702 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint16
CUB:  132.58889; CuPy:  132.49034 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint32
CUB:  140.29738; CuPy:  140.41062 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=uint64
CUB:  147.43299; CuPy:  147.43610 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float16
CUB:  141.54595; CuPy:  141.49002 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float32
CUB:  149.38880; CuPy:  149.31869 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=float64
CUB:  161.41210; CuPy:  161.38314 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex64
CUB:  175.68266; CuPy:  175.64086 (ms), for 20 runs, shape=(128, 256, 256), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_2.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.79536; CuPy:  109.64730 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int8
CUB:    0.82611; CuPy:  114.18134 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int16
CUB:    1.19994; CuPy:  114.68288 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int32
CUB:    2.17834; CuPy:  124.61891 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int64
CUB:    0.69245; CuPy:  108.88109 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint8
CUB:    0.82675; CuPy:  113.64685 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint16
CUB:    1.19978; CuPy:  114.63072 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint32
CUB:    2.17706; CuPy:  124.69686 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint64
CUB:    0.82605; CuPy:  138.19821 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float16
CUB:    1.22186; CuPy:  132.35994 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float32
CUB:    2.26493; CuPy:  142.84656 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float64
CUB:    2.14358; CuPy:  165.03171 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex64
CUB:    3.76851; CuPy:  176.57674 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex128
CUB:    2.58307; CuPy:    1.78470 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int8
CUB:    2.70896; CuPy:    1.87226 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int16
CUB:    3.05891; CuPy:    1.94067 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int32
CUB:    3.87389; CuPy:    2.39798 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int64
CUB:    2.58166; CuPy:    1.77190 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint8
CUB:    2.69571; CuPy:    1.87139 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint16
CUB:    3.10067; CuPy:    1.94720 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint32
CUB:    3.93286; CuPy:    2.40890 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint64
CUB:    2.75085; CuPy:    2.06138 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float16
CUB:    3.12355; CuPy:    2.06374 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float32
CUB:    3.90243; CuPy:    2.46022 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float64
CUB:    3.94880; CuPy:    2.58288 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex64
CUB:    5.54752; CuPy:    3.75248 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex128
CUB:    4.09427; CuPy:    5.20714 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int8
CUB:    4.07987; CuPy:    5.11590 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int16
CUB:    3.77520; CuPy:    5.10714 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int32
CUB:    3.81616; CuPy:    7.35078 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int64
CUB:    4.01322; CuPy:    5.05654 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint8
CUB:    3.94768; CuPy:    5.07296 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint16
CUB:    3.79437; CuPy:    5.11494 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint32
CUB:    3.90106; CuPy:    7.36154 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint64
CUB:    3.94240; CuPy:    5.87226 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float16
CUB:    3.85715; CuPy:    5.43440 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float32
CUB:    3.86035; CuPy:    7.94909 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float64
CUB:    4.46806; CuPy:    9.52746 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex64
CUB:    5.55808; CuPy:   15.46774 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex128
____________________________________________________ TestCUBperformance_param_2.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    0.79475; CuPy:  109.61091 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int8
CUB:    0.82963; CuPy:  114.17046 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int16
CUB:    1.19942; CuPy:  114.63987 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int32
CUB:    2.17830; CuPy:  124.65651 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int64
CUB:    0.68835; CuPy:  108.92672 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint8
CUB:    0.82627; CuPy:  113.65392 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint16
CUB:    1.22131; CuPy:  114.66429 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint32
CUB:    2.17200; CuPy:  124.64541 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint64
CUB:    0.83299; CuPy:  138.14150 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float16
CUB:    1.23210; CuPy:  132.35865 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float32
CUB:    2.26643; CuPy:  142.79750 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float64
CUB:    2.13994; CuPy:  159.51024 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex64
CUB:    3.75984; CuPy:  169.65226 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex128
CUB:    2.57882; CuPy:    1.78784 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int8
CUB:    2.69850; CuPy:    1.87146 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int16
CUB:    3.06496; CuPy:    1.95360 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int32
CUB:    3.88205; CuPy:    2.39389 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int64
CUB:    2.57718; CuPy:    1.76109 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint8
CUB:    2.70675; CuPy:    1.87229 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint16
CUB:    3.06662; CuPy:    1.94438 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint32
CUB:    3.85456; CuPy:    2.39174 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint64
CUB:    2.86250; CuPy:    2.08182 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float16
CUB:    3.20019; CuPy:    2.09178 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float32
CUB:    3.92970; CuPy:    2.49178 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float64
CUB:    3.93955; CuPy:    2.57405 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex64
CUB:    5.44592; CuPy:    3.74381 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex128
CUB:    4.08224; CuPy:    5.20723 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   [114/853]
CUB:    4.06861; CuPy:    5.11341 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int16
CUB:    3.79472; CuPy:    5.11277 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int32
CUB:    3.89171; CuPy:    7.37050 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int64
CUB:    4.00422; CuPy:    5.04486 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint8
CUB:    3.95283; CuPy:    5.06582 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint16
CUB:    3.79718; CuPy:    5.12528 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint32
CUB:    3.82822; CuPy:    7.34659 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint64
CUB:    3.92851; CuPy:    5.87062 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float16
CUB:    3.87251; CuPy:    5.43318 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float32
CUB:    3.84627; CuPy:    7.95293 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float64
CUB:    4.56762; CuPy:    9.32144 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex64
CUB:    5.65837; CuPy:   15.50704 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex128
____________________________________________________ TestCUBperformance_param_2.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    2.22928; CuPy:  120.01379 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=int64
CUB:    2.22182; CuPy:  120.01667 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=uint64
CUB:    0.86310; CuPy:  110.06707 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float16
CUB:    1.23251; CuPy:  114.50406 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float32
CUB:    2.20778; CuPy:  120.33984 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=float64
CUB:    2.02397; CuPy:  120.82045 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex64
CUB:    5.24275; CuPy:  130.40291 (ms), for 20 runs, shape=(128, 256, 256), axis=(0, 1, 2), dtype=complex128
CUB:    3.96653; CuPy:    2.46886 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=int64
CUB:    3.98042; CuPy:    2.47629 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=uint64
CUB:    2.88390; CuPy:    1.92173 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float16
CUB:    3.12998; CuPy:    1.89600 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float32
CUB:    3.90038; CuPy:    2.46147 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=float64
CUB:    3.91110; CuPy:    2.46979 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex64
CUB:    5.57315; CuPy:    3.64250 (ms), for 20 runs, shape=(128, 256, 256), axis=(1, 2), dtype=complex128
CUB:    3.84886; CuPy:    4.53485 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=int64
CUB:    3.85078; CuPy:    4.53085 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=uint64
CUB:    3.88851; CuPy:    4.24058 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float16
CUB:    3.71040; CuPy:    4.28352 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float32
CUB:    3.90730; CuPy:    4.43290 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=float64
CUB:    3.92058; CuPy:    4.45674 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex64
CUB:    6.23174; CuPy:    4.95997 (ms), for 20 runs, shape=(128, 256, 256), axis=(2,), dtype=complex128
___________________________________________________ TestCUBperformance_param_3.test_cub_argmax_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  527.97558; CuPy:  527.95715 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int8
CUB:  525.55520; CuPy:  525.49603 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int16
CUB:  529.93830; CuPy:  529.75583 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int32
CUB:  562.13997; CuPy:  561.85366 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int64
CUB:  506.65613; CuPy:  506.67002 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint8
CUB:  533.61638; CuPy:  533.39385 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint16
CUB:  529.67911; CuPy:  529.83818 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint32
CUB:  561.93136; CuPy:  562.07837 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint64
CUB:  589.31929; CuPy:  589.33958 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float16
CUB:  566.21408; CuPy:  566.17069 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float32
CUB:  596.60586; CuPy:  596.62045 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float64
CUB:  644.10003; CuPy:  643.99671 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex64
CUB:  701.16557; CuPy:  701.17552 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex128
___________________________________________________ TestCUBperformance_param_3.test_cub_argmin_performance ___________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:  528.00950; CuPy:  527.93328 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int8
CUB:  525.51120; CuPy:  525.61997 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int16
CUB:  529.80819; CuPy:  529.69127 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int32
CUB:  561.99734; CuPy:  561.89549 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=int64
CUB:  506.76112; CuPy:  506.62019 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint8
CUB:  533.42521; CuPy:  533.53350 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint16
CUB:  529.84861; CuPy:  529.61773 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint32
CUB:  561.93843; CuPy:  561.95257 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=uint64
CUB:  589.66528; CuPy:  589.50102 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float16
CUB:  566.23200; CuPy:  566.16836 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float32
CUB:  596.59635; CuPy:  596.61255 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=float64
CUB:  644.10836; CuPy:  644.16418 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex64
CUB:  701.15526; CuPy:  701.25541 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=None, dtype=complex128
____________________________________________________ TestCUBperformance_param_3.test_cub_max_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    1.26438; CuPy:  442.40794 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int8
CUB:    1.96602; CuPy:  459.81024 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int16
CUB:    3.46208; CuPy:  460.51927 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int32
CUB:    6.90336; CuPy:  498.71206 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int64
CUB:    1.24381; CuPy:  439.17524 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint8
CUB:    1.97056; CuPy:  456.38557 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint16
CUB:    3.46784; CuPy:  460.37885 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint32
CUB:    6.90253; CuPy:  498.61936 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint64
CUB:    1.97926; CuPy:  552.34064 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float16
CUB:    3.47296; CuPy:  528.73184 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float32
CUB:    7.05555; CuPy:  570.07392 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float64
CUB:    6.75261; CuPy:  659.06864 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex64
CUB:   13.35094; CuPy:  704.52774 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex128
CUB:   36.11469; CuPy:  134.60166 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int8
CUB:   37.30826; CuPy:  139.34685 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int16
CUB:   22.65994; CuPy:  140.33334 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int32
CUB:   73.78838; CuPy:  147.19219 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int64
CUB:   36.12944; CuPy:  133.76592 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint8
CUB:   37.10637; CuPy:  138.41779 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint16
CUB:   22.70163; CuPy:  140.25296 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint32
CUB:   73.86861; CuPy:  147.29437 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint64
CUB:   39.56518; CuPy:  160.66512 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float16
CUB:   22.66662; CuPy:  156.51443 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float32
CUB:   75.58477; CuPy:  161.60963 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float64
CUB:  105.85056; CuPy:  186.22461 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex64
CUB:  257.67456; CuPy:  204.04099 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex128
CUB:    3.16291; CuPy:    5.38061 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int8
CUB:    3.83200; CuPy:    5.71939 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int16
CUB:    5.34976; CuPy:    5.93520 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int32
CUB:    8.36838; CuPy:    7.89072 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int64
CUB:    3.18112; CuPy:    5.32762 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint8
CUB:    3.81165; CuPy:    5.70621 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint16
CUB:    5.35690; CuPy:    6.37478 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint32
CUB:    8.45866; CuPy:    8.19680 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint64
CUB:    3.92304; CuPy:    7.10093 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float16
CUB:    5.38653; CuPy:    7.01114 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float32
CUB:    8.47747; CuPy:    8.52400 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float64
CUB:    8.57514; CuPy:    9.25642 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex64
CUB:   14.72294; CuPy:   14.28918 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex128
CUB:   10.61638; CuPy:   21.57011 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int8
CUB:   10.65123; CuPy:   21.11382 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int16
CUB:    9.04826; CuPy:   21.07725 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int32
CUB:    8.51146; CuPy:   31.20365 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int64
CUB:   10.38902; CuPy:   20.84662 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint8
CUB:    9.96221; CuPy:   20.87440 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint16
CUB:    9.05613; CuPy:   21.07322 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint32
CUB:    8.51958; CuPy:   31.20442 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint64
CUB:    9.86390; CuPy:   24.57258 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float16
CUB:    9.21978; CuPy:   22.56656 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float32
CUB:    8.58342; CuPy:   33.86915 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float64
CUB:   11.57357; CuPy:   40.93101 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex64
CUB:   15.58432; CuPy:   67.68730 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex128
____________________________________________________ TestCUBperformance_param_3.test_cub_min_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    1.26621; CuPy:  444.85636 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int8
CUB:    1.96550; CuPy:  459.95104 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int16
CUB:    3.47942; CuPy:  460.99185 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int32
CUB:    6.90966; CuPy:  499.00960 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int64
CUB:    1.23322; CuPy:  439.44874 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint8
CUB:    1.96218; CuPy:  456.77056 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint16
CUB:    3.46624; CuPy:  460.78016 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint32
CUB:    6.90621; CuPy:  498.87200 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint64
CUB:    1.97654; CuPy:  553.05798 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float16
CUB:    3.48227; CuPy:  529.49190 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float32
CUB:    7.05603; CuPy:  570.83306 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float64
CUB:    6.74070; CuPy:  636.96307 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex64
CUB:   13.34067; CuPy:  678.18134 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex128
CUB:   36.12374; CuPy:  134.64118 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int8
CUB:   37.29402; CuPy:  139.42710 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int16
CUB:   22.68195; CuPy:  140.33981 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int32
CUB:   73.98429; CuPy:  147.24838 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int64
CUB:   36.25930; CuPy:  133.89366 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint8
CUB:   37.17056; CuPy:  138.46640 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint16
CUB:   22.74227; CuPy:  140.32512 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint32
CUB:   73.84886; CuPy:  147.25933 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint64
CUB:   39.61910; CuPy:  160.72874 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float16
CUB:   22.60314; CuPy:  156.61958 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float32
CUB:   75.53373; CuPy:  161.67779 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float64
CUB:   96.40064; CuPy:  179.52109 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex64
CUB:  238.31469; CuPy:  196.89110 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex128
CUB:    3.11674; CuPy:    5.35654 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int8
CUB:    3.85088; CuPy:    5.73456 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int16
CUB:    5.32435; CuPy:    5.93517 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int32
CUB:    8.36749; CuPy:    7.88176 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int64
CUB:    3.09712; CuPy:    5.31069 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint8
CUB:    3.83658; CuPy:    5.70464 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint16
CUB:    5.46026; CuPy:    5.93856 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint32
CUB:    8.38349; CuPy:    7.89491 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint64
CUB:    3.94064; CuPy:    6.49104 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float16
CUB:    5.37296; CuPy:    6.47002 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float32
CUB:    8.40042; CuPy:    8.15654 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float64
CUB:    8.52947; CuPy:    8.55504 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex64
CUB:   14.65722; CuPy:   13.84944 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex128
CUB:    9.80493; CuPy:   19.23325 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int8
CUB:    9.76560; CuPy:   18.86701 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int16
CUB:    8.52282; CuPy:   18.83930 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int32
CUB:    8.54643; CuPy:   27.71821 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int64
CUB:    9.54755; CuPy:   18.60224 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint8
CUB:    9.21558; CuPy:   18.66512 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint16
CUB:    9.24413; CuPy:   21.67030 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint32
CUB:    8.62368; CuPy:   32.13984 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint64
CUB:   10.03130; CuPy:   25.24800 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float16
CUB:    9.42534; CuPy:   23.17302 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float32
CUB:    8.57414; CuPy:   34.83206 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float64
CUB:   12.11955; CuPy:   41.14906 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex64
CUB:   15.42637; CuPy:   63.56301 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex128
____________________________________________________ TestCUBperformance_param_3.test_cub_sum_performance _____________________________________________________
-------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------
CUB:    6.97610; CuPy:  480.08227 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=int64
CUB:    6.94179; CuPy:  480.14163 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=uint64
CUB:    1.99718; CuPy:  447.76698 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float16
CUB:    3.47312; CuPy:  459.71120 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float32
CUB:    6.92179; CuPy:  481.45302 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=float64
CUB:    6.77766; CuPy:  483.35837 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex64
CUB:   14.80320; CuPy:  521.10055 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(0, 1, 2, 3), dtype=complex128
CUB:   70.57578; CuPy:  147.89293 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=int64
CUB:   70.54896; CuPy:  147.95360 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=uint64
CUB:   42.92646; CuPy:  137.99280 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float16
CUB:   21.15168; CuPy:  140.78490 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float32
CUB:   70.59066; CuPy:  145.67962 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=float64
CUB:   72.87347; CuPy:  145.48374 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex64
CUB:  266.10330; CuPy:  152.96611 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(1, 2, 3), dtype=complex128
CUB:    8.38550; CuPy:    7.79805 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=int64
CUB:    8.43165; CuPy:    7.81923 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=uint64
CUB:    3.92800; CuPy:    5.62800 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float16
CUB:    5.38384; CuPy:    5.87661 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float32
CUB:    8.45389; CuPy:    7.76954 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=float64
CUB:    8.45251; CuPy:    7.78051 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex64
CUB:   14.53507; CuPy:   13.17850 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(2, 3), dtype=complex128
CUB:    8.75450; CuPy:   16.49021 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=int64
CUB:    8.87261; CuPy:   16.48438 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=uint64
CUB:    8.94595; CuPy:   15.36790 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float16
CUB:    7.98595; CuPy:   15.48554 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float32
CUB:    8.58381; CuPy:   16.06954 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=float64
CUB:    8.83651; CuPy:   16.16109 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex64
CUB:   17.87510; CuPy:   18.14227 (ms), for 20 runs, shape=(4, 128, 256, 256), axis=(3,), dtype=complex128
================================================================== short test summary info ===================================================================

@leofang
Copy link
Copy Markdown
Member

leofang commented Nov 4, 2019

Thanks for the quick tests.

half-precision is supported since CUDA 7.5 (Compute Capability 5.3 and above). I am wondering that CUDA_VERSION (i.e. CUDACC_VER_MAJOR ) can be tested on host-code (dtype dispatcher) but CC can be tested on device-code only (i.e. inside of CUB) so we cannot handle this checking using dtype dispacher only. What do you think?

I think CUB chose __CUDACC_VER_MAJOR__ >= 9 because of this reason:

The new behavior is implemented by using C++ operator overloads defined in the cuda_fp16.h header file. Note that the new types are incompatible with the previous CUDA 8.0 definitions.

(from https://docs.nvidia.com/cuda/archive/9.0/cuda-toolkit-release-notes/index.html). So it might be possible that CUB relies on this behavior, but unfortunately I don't have a CUDA 8.0 env to test this myself.

I think for the dtype dispatcher in cupy_cub.cu, you can just test __CUDACC_VER_MAJOR__ >= 9 for case CUPY_CUB_FLOAT16. In cub.pyx, you can then add a runtime test to decide if you wanna add numpy.float16 to the supported dtype lists. This should be enough to ensure we don't invoke CUB for float16.

  1. For the CUDA versions that support __half, CUB already specialized a NumericTraits for it, so I don't see why we need to include <cupy/carray.cuh> and to duplicate the effort. Can you test if the native CUB support for __half just works?

Yes, it works on my environment (CUDA and CC is recent version).

Oh that's great! So you can remove the template specializations, get rid of cupy/carray.cuh, and just use __half? Can you push the commit?

My implementation cannot pass the TestCUBreduction_param_(0,1,2,3) test of summation.
min/max are working correctly so the problem is tolerance.

Yeah it's possible, don't worry about correctness. It's hard to get agreement for float16 for sum(). I was just curious about how it performs compared to CuPy's reduction kernel.

I think fp16-reduction and fp32-reduction is same performance.

The situation is mixed. I guess it's likely the the test sizes are too small to see real performance of fp16.

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Nov 4, 2019

I've updated this PR based on your comment.
Changes:

  • Use __half instead of float16
  • #if __CUDACC_VER_MAJOR__ >= 9 && (__CUDA_ARCH__ >= 530 || !defined(__CUDA_ARCH__)) to check __half code should be compiled. without __CUDA_ARCH__ checking, cupy build with cc3.0 will be failed.
  • Add runtime check of cc on cub.pyx

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Nov 4, 2019

Please wait merge. current head cannot use cub. I'll investigate this.

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Nov 4, 2019

I fixed the issue.

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y1r Thanks for the fast updates and for catching the issue of supported CUDA architectures. I wasn't aware of it (I knew almost nothing about half precision...). I just left a few comments.

leofang added a commit to leofang/cupy that referenced this pull request Nov 4, 2019
Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as test_cub.py can be passed (apart from the sum failure for float16 of course). @kmaehashi probably can provide you better, more pythonic ways for making the needed changes, so please expect inputs from him. 🙂 Thank you again for working on this @y1r.

@y1r y1r mentioned this pull request Nov 6, 2019


cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None):
dev_id = device.get_device_id()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y1r I just found that every cupy.ndarray has an attribute device:

>>> import cupy as cp
>>> a = cp.arange(10)
>>> a.device
<CUDA Device 0>
>>> a.device.id
0

...and the underlying MemoryPointer also stores this information:

>>> a.data
<cupy.cuda.memory.MemoryPointer object at 0x7f5b41c9e120>             
>>> a.data.device
<CUDA Device 0>
>>> a.data.device_id
0

Do you think it'd be easier to just use it? It'd involve changing a few function signatures from bottom up, so that the dev id can be passed from a higher level down. Could be a bit ugly.

I am not saying we should do this, just want to share this new finding. Should have known this earlier...😛

@jakirkham
Copy link
Copy Markdown
Member

cc @anaruse (who may be interested in this)

@leofang
Copy link
Copy Markdown
Member

leofang commented Nov 18, 2019

@y1r Note that after the bug fix #2636 you need to provide template specializations to handle fp16 NaNs. It should be fairly straightforward.

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Nov 18, 2019

@leofang Thanks for the info! I'll fix.

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Dec 3, 2019

@leofang I've updated my PR based on your suggestion.
isnan() is not available for __half, so I used __hisnan for device code and casting to float for host code (I think the performance degradation is small)

@leofang
Copy link
Copy Markdown
Member

leofang commented Dec 3, 2019

@grlee77 Given that your recent refactoring work on CUB just went into upstream, could you help review this PR?

@leofang
Copy link
Copy Markdown
Member

leofang commented Dec 3, 2019

Given that your recent refactoring work on CUB just went into upstream

Sorry, it was your sparse matrix support that got merged, not the refactoring...

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Dec 9, 2019

@leofang Refactored my implementation. Could you give me a review again?

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for one nitpick. Thank you @y1r for the improvements!

Copy link
Copy Markdown
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kmaehashi kmaehashi added this to the v8.0.0a1 milestone Jan 28, 2020
@kmaehashi kmaehashi assigned niboshi and unassigned kmaehashi Feb 6, 2020
@niboshi
Copy link
Copy Markdown
Member

niboshi commented Feb 10, 2020

Sorry for delay.
LGTM but could you resolve conflict?

@leofang
Copy link
Copy Markdown
Member

leofang commented Feb 12, 2020

@y1r If you need help to resolve conflicts, you can add me as a collaborator to your repo.

@niboshi niboshi added the st:awaiting-author Awaiting response from author label Feb 13, 2020
@emcastillo emcastillo modified the milestones: v8.0.0a1, v8.0.0b1 Feb 14, 2020
@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Feb 16, 2020

Thank you for checking my PR. I'll resolve the conflict soon. Sorry for the late reaction.

@y1r
Copy link
Copy Markdown
Contributor Author

y1r commented Feb 17, 2020

@niboshi @leofang I updated my PR. Please review this.

@niboshi niboshi added cat:enhancement Improvements to existing features st:test-and-merge (deprecated) Ready to merge after test pass. and removed st:awaiting-author Awaiting response from author labels Feb 18, 2020
@niboshi
Copy link
Copy Markdown
Member

niboshi commented Feb 18, 2020

Thank you!
Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit 7bc6d62:

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 7bc6d62, target branch master) succeeded!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:enhancement Improvements to existing features st:test-and-merge (deprecated) Ready to merge after test pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants