Use CUB to speed up sum/min/max#2090
Conversation
|
I've modified the PR to make it easy-to-maintain. |
|
Could you tell how a user can enable cub and how cupy checks for it? |
|
I suppose your question is about how to build this PR with cub enabled? |
|
I've modified the PR, so that it check a environment variable CUB_PATH in build time. |
|
Thanks, I'll give it a try! |
setup.py
Outdated
| 'core/include/cupy/_cuda/cuda-*/*.h', | ||
| 'core/include/cupy/_cuda/cuda-*/*.hpp', | ||
| 'cuda/cupy_thrust.cu', | ||
| 'cuda/cupy_cub.cu', |
There was a problem hiding this comment.
It seems this line is not needed.
cupy/cuda/cupy_cub.cu
Outdated
| void cub_reduce_sum(void *x, void *y, int num_items, | ||
| void *workspace, size_t &workspace_size, int dtype_id) | ||
| { | ||
| void (*f)(void*, void*, int, void*, size_t&); |
There was a problem hiding this comment.
Let's avoid uninitialized variable by assigning = NULL.
cupy/cuda/cupy_cub.cu
Outdated
| size_t cub_reduce_sum_get_workspace_size(void *x, void *y, int num_items, | ||
| int dtype_id) | ||
| { | ||
| size_t workspace_size; |
cupy/cuda/cupy_cub.cu
Outdated
| void cub_reduce_min(void *x, void *y, int num_items, | ||
| void *workspace, size_t &workspace_size, int dtype_id) | ||
| { | ||
| void (*f)(void*, void*, int, void*, size_t&); |
cupy/cuda/cupy_cub.cu
Outdated
| size_t cub_reduce_min_get_workspace_size(void *x, void *y, int num_items, | ||
| int dtype_id) | ||
| { | ||
| size_t workspace_size; |
| } | ||
|
|
||
| size_t cub_reduce_sum_get_workspace_size(void *x, void *y, int num_items, | ||
| int dtype_id) |
There was a problem hiding this comment.
Please fix indentation.
There are some other cases.
cupy/cuda/cupy_cub.cu
Outdated
| case CUPY_CUB_FLOAT32: f = _cub_reduce_sum<float>; break; | ||
| case CUPY_CUB_FLOAT64: f = _cub_reduce_sum<double>; break; | ||
| default: | ||
| std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl; |
There was a problem hiding this comment.
Let's throw an exception. std::invalid_argument?
cupy/cuda/cupy_cub.cu
Outdated
| case CUPY_CUB_FLOAT32: f = _cub_reduce_min<float>; break; | ||
| case CUPY_CUB_FLOAT64: f = _cub_reduce_min<double>; break; | ||
| default: | ||
| std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl; |
cupy/cuda/cupy_cub.cu
Outdated
| size_t cub_reduce_max_get_workspace_size(void *x, void *y, int num_items, | ||
| int dtype_id) | ||
| { | ||
| size_t workspace_size; |
cupy/cuda/cupy_cub.cu
Outdated
| std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl; | ||
| break; | ||
| } | ||
| (*f)(x, y, num_items, workspace, workspace_size); |
There was a problem hiding this comment.
(*f) can be written as just f. (ditto for other cases).
|
Thanks for your review, and sorry for my mistake that I pushed a commit w/o noticing your review. Please wait for a while. I will fix the PR based on your feedback soon. |
|
Thanks for your review and I think all your feedback has been got reflected. |
|
|
||
| cub_path = os.environ.get('CUB_PATH', '') | ||
| if os.path.exists(cub_path): | ||
| include_dirs.append(cub_path) |
There was a problem hiding this comment.
Should it be include_dirs.append(os.path.join(cub_path, 'include'))?
There was a problem hiding this comment.
Thanks for your comment and sorry for my very late reply.
"include", I though I forgot adding that, but actually it is not necessary since cub does not have "include" directory. Please see https://github.com/NVlabs/cub.
Thanks!
|
@niboshi, would it be possible to get another review? Sounds like most concerns raised have been addressed. |
cupy/cuda/cupy_cub.cu
Outdated
| }; | ||
|
|
||
| void cub_reduce_sum(void *x, void *y, int num_items, | ||
| void *workspace=NULL, size_t &workspace_size, int dtype_id) |
There was a problem hiding this comment.
| void *workspace=NULL, size_t &workspace_size, int dtype_id) | |
| void *workspace, size_t &workspace_size, int dtype_id) |
There was a problem hiding this comment.
Thanks for your suggestion!
Seems it was OK with CUDA 9.x but not with CUDA 10.x.
There was a problem hiding this comment.
Ah, I was wrong. it is not OK with CUDA 9.x.
There was a problem hiding this comment.
Yes, this is a C++ limitation:
In a function declaration, after a parameter with a default argument, all subsequent parameters must have a default argument supplied in this or a previous declaration from the same scope
Source: https://en.cppreference.com/w/cpp/language/default_arguments
cupy/cuda/cupy_cub.cu
Outdated
| }; | ||
|
|
||
| void cub_reduce_min(void *x, void *y, int num_items, | ||
| void *workspace=NULL, size_t &workspace_size, int dtype_id) |
There was a problem hiding this comment.
| void *workspace=NULL, size_t &workspace_size, int dtype_id) | |
| void *workspace, size_t &workspace_size, int dtype_id) |
cupy/cuda/cupy_cub.cu
Outdated
| }; | ||
|
|
||
| void cub_reduce_max(void *x, void *y, int num_items, | ||
| void *workspace=NULL, size_t &workspace_size, int dtype_id) |
There was a problem hiding this comment.
| void *workspace=NULL, size_t &workspace_size, int dtype_id) | |
| void *workspace, size_t &workspace_size, int dtype_id) |
|
An environment variable CUPY_DETERMINISTIC added, so that you can change settings as for whether you allow non-deterministic computation or not. A non-deterministic computation is allowed when CUPY_DETERMINISTIC = 0 (default is 0). |
|
Following our discussion during the PFN-NVIDIA call today, I was wondering, is there a place where we should document |
|
Thank you for discussion today. An environment variable "CUB_DISABLED" is used instead of "CUPY_DETERMINISTIC" to avoid unncecessary confusion as discussed. Now, you can disable use of CUB by setting "CUB_DISABLE=1". Default is CUB enabled. |
|
Jenkins, test this please |
|
Successfully created a job for commit 48a8c11: |
|
Jenkins CI test (for commit 48a8c11, target branch master) failed with status FAILURE. |
|
Hmm, it seems C++14 is not supported in CUDA 8. |
|
Would it be OK to make "fast reductions by CUB" available only when CUDA 9.0 or later? |
cupy/cuda/__init__.py
Outdated
| return _cuda_path | ||
|
|
||
|
|
||
| def use_cub(): |
There was a problem hiding this comment.
Why do you use function call?
This is very small function. You can use single variable, for example cub_use.
Or, you can avoid importing cub if CUB_DISABLED` is setting.
cub_enabled = False
if int(os.getenv('CUB_DISABLED', 0)):
try:
from cupy.cuda import cub # NOQA
cub_enabled = True
except ImportError:
pass
There was a problem hiding this comment.
Thanks for your comment. Looks it is better. I will fix it in that way.
|
Could you run CI tests again? This PR is now available with CUDA 8.0 (C++11). |
|
I'm sorry, I overlooked the last comment. |
|
Successfully created a job for commit 88d6a8d: |
|
Jenkins CI test (for commit 88d6a8d, target branch master) failed with status FAILURE. |
| thrust_enabled = False | ||
|
|
||
| cub_enabled = False | ||
| if int(os.getenv('CUB_DISABLED', 0)) == 0: |
There was a problem hiding this comment.
This is contrary to this comment, but I think it's better to have a function use_cub(), because we need to provide a way to switch the mode at runtime instead of import-time (related: #2436).
Even if there's no such immediate use case, it's safer to expose a function instead of a variable, so that we can have a greater control in the future.
For example, in chainer/chainer#2967 I tried to reflect the number of available devices in chainer.cuda.available, but gave up.
There was a problem hiding this comment.
I'm a bit confused, so let me check.
You think that we should use not only an environment variable CUB_DISABLED but also a function use_cub() as before so that user can switch the mode at runtime as well, right?
There was a problem hiding this comment.
This is contrary to this comment,
This is because the code snippet there was opposite to his intention :)
There was a problem hiding this comment.
Is the following implementation OK?
- If CuPy is NOT build with CUB
--> CUB is never used. - If CuPy is build with CUB
- If an environment variable
CUB_DISABLEDis set
--> CUB is never used. - If an environment variable
CUB_DISABLEDis not set
--> CUB is used as default.- If a function
disable_cub()is called.
--> CUB is not used as default. - If a function
enable_cub()is called.
--> CUB is used as default.
- If a function
- If an environment variable
There was a problem hiding this comment.
Sorry for being unclear.
We don't need an interface for users to switch the behavior in this PR.
I think the previous implementation was better because after #2436 is implemented CuPy will need to determine the behavior at runtime. If cupy.cuda.cub_enabled is only for internal use, we don't even have to do that now. We can just implement that later.
There was a problem hiding this comment.
All right. So, you are suggesting to revert the last commit (88d6a8d), right?
There was a problem hiding this comment.
Yes, that's what I meant. But I'll merge the PR now anyway as that can be done later.
|
Jenkins, test this please |
|
Successfully created a job for commit 88d6a8d: |
|
Jenkins CI test (for commit 88d6a8d, target branch master) succeeded! |
|
Thanks! |
|
Thanks for merging the PR! |
This PR aims to improve performance of reductions in CuPy using CUB (https://github.com/NVlabs/cub). This is related to #2085.
The performance of
sum/min/max()will be greatly improved with this PR specifically when array size is large. The followings are results of performance measure ofmax()using the script in #2085.Note: this PR only supports "single reduction (reducing the array to a single element)" for now, and does not support "batched reduction". Thus, CUB implementations are not used when you specify a argument
axisofsum/min/max().