Conversation
leofang
left a comment
There was a problem hiding this comment.
Thanks a lot for filling in timely this missing support. I wanted to try it myself, but decided that it's not my priority.😅
I have three questions on this PR:
- As far as I can tell -- correct me if I am wrong -- the support of half precision (
__half) begins only in recent CUDA versions (since 9.0 I suppose? Not sure why CuPy wants to bundlecuda_fp16.hfor 9.2 and above...). If we want to enable CUB, we should test if__CUDACC_VER_MAJOR__ >= 9, and drop the support for, say, CUDA 8.0. And this needs to be done in both the C++ and the Python codes. - For the CUDA versions that support
__half, CUB already specialized aNumericTraitsfor it, so I don't see why we need to include<cupy/carray.cuh>and to duplicate the effort. Can you test if the native CUB support for__halfjust works? - I am interested in the performance gain for float16. Can you modify the test in #2598 and show the performance comparison for it? (You just need to modify the strings
lLfdFDtolLefdFDandbhilBHILfdFDtobhilBHILefdFDthere, i.e., adding'e'. Run it usingpython -m pytest -rA -v ...to see the full stdout.)
Thanks again.
|
Thank you for the quick response.
half-precision is supported since CUDA 7.5 (Compute Capability 5.3 and above). I am wondering that CUDA_VERSION (i.e.
Yes, it works on my environment (CUDA and CC is recent version, V100 SXM2 16GB).
My implementation cannot pass the TestCUBreduction_param_(0,1,2,3) test of summation. |
|
Thanks for the quick tests.
I think CUB chose
(from https://docs.nvidia.com/cuda/archive/9.0/cuda-toolkit-release-notes/index.html). So it might be possible that CUB relies on this behavior, but unfortunately I don't have a CUDA 8.0 env to test this myself. I think for the dtype dispatcher in
Oh that's great! So you can remove the template specializations, get rid of
Yeah it's possible, don't worry about correctness. It's hard to get agreement for
The situation is mixed. I guess it's likely the the test sizes are too small to see real performance of fp16. |
|
I've updated this PR based on your comment.
|
|
Please wait merge. current head cannot use cub. I'll investigate this. |
|
I fixed the issue. |
leofang
left a comment
There was a problem hiding this comment.
LGTM as long as test_cub.py can be passed (apart from the sum failure for float16 of course). @kmaehashi probably can provide you better, more pythonic ways for making the needed changes, so please expect inputs from him. 🙂 Thank you again for working on this @y1r.
|
|
||
|
|
||
| cdef _cub_reduce_dtype_compatible(x_dtype, int op, dtype=None): | ||
| dev_id = device.get_device_id() |
There was a problem hiding this comment.
@y1r I just found that every cupy.ndarray has an attribute device:
>>> import cupy as cp
>>> a = cp.arange(10)
>>> a.device
<CUDA Device 0>
>>> a.device.id
0...and the underlying MemoryPointer also stores this information:
>>> a.data
<cupy.cuda.memory.MemoryPointer object at 0x7f5b41c9e120>
>>> a.data.device
<CUDA Device 0>
>>> a.data.device_id
0Do you think it'd be easier to just use it? It'd involve changing a few function signatures from bottom up, so that the dev id can be passed from a higher level down. Could be a bit ugly.
I am not saying we should do this, just want to share this new finding. Should have known this earlier...😛
|
cc @anaruse (who may be interested in this) |
|
@leofang Thanks for the info! I'll fix. |
|
@leofang I've updated my PR based on your suggestion. |
|
@grlee77 Given that your recent refactoring work on CUB just went into upstream, could you help review this PR? |
Sorry, it was your sparse matrix support that got merged, not the refactoring... |
|
@leofang Refactored my implementation. Could you give me a review again? |
|
Sorry for delay. |
|
@y1r If you need help to resolve conflicts, you can add me as a collaborator to your repo. |
|
Thank you for checking my PR. I'll resolve the conflict soon. Sorry for the late reaction. |
|
Thank you! |
|
Successfully created a job for commit 7bc6d62: |
|
Jenkins CI test (for commit 7bc6d62, target branch master) succeeded! |
Currently, the CUB binding of CuPy doesn't support fp16 but CUB supports it.
I implemented fp16 support of CUB binding by using
float16class oncarray.cuh.I am not sure what class I should use for this implementation, so please give me comments.