Use CUB to speed up `sum`/`min`/`max` by anaruse · Pull Request #2090 · cupy/cupy

anaruse · 2019-03-07T10:55:15Z

This PR aims to improve performance of reductions in CuPy using CUB (https://github.com/NVlabs/cub). This is related to #2085.

The performance of sum/min/max() will be greatly improved with this PR specifically when array size is large. The followings are results of performance measure of max() using the script in #2085.

cupy w/o CUB : 948.2791748046875 ms
cupy with CUB : 6.752448081970215 ms
numpy : 281.2850341796875 ms

Note: this PR only supports "single reduction (reducing the array to a single element)" for now, and does not support "batched reduction". Thus, CUB implementations are not used when you specify a argument axis of sum/min/max().

anaruse · 2019-03-29T09:12:25Z

I've modified the PR to make it easy-to-maintain.
Could you review it if the design is OK or not?

niboshi · 2019-04-01T06:13:51Z

Could you tell how a user can enable cub and how cupy checks for it?

anaruse · 2019-04-01T07:40:37Z

I suppose your question is about how to build this PR with cub enabled?
I thought you could do it by just adding a path of CUB to CFLAGS env variable, but I noticed it did not work. It seems that CFLAGS is not referred when building *.cu files by nvcc. I will check it, so please wait.
BTW, if there is a link to CUB path in ${CUDA_PATH}/include, you can enable cub.

anaruse · 2019-04-01T08:12:24Z

I've modified the PR, so that it check a environment variable CUB_PATH in build time.
You can build this PR with CUB enabled by setting CUB_PATH like below.

CUB_PATH=/opt/cub/cub-1.8.0 python setup.py install

niboshi · 2019-04-01T08:31:03Z

Thanks, I'll give it a try!

niboshi · 2019-04-01T11:08:13Z

setup.py

        'core/include/cupy/_cuda/cuda-*/*.h',
        'core/include/cupy/_cuda/cuda-*/*.hpp',
        'cuda/cupy_thrust.cu',
+        'cuda/cupy_cub.cu',


It seems this line is not needed.

niboshi · 2019-04-01T11:09:09Z

cupy/cuda/cupy_cub.cu

+void cub_reduce_sum(void *x, void *y, int num_items,
+		    void *workspace, size_t &workspace_size, int dtype_id)
+{
+    void (*f)(void*, void*, int, void*, size_t&);


Let's avoid uninitialized variable by assigning = NULL.

niboshi · 2019-04-01T11:09:26Z

cupy/cuda/cupy_cub.cu

+size_t cub_reduce_sum_get_workspace_size(void *x, void *y, int num_items,
+					 int dtype_id)
+{
+    size_t workspace_size;


Ditto. = 0.

niboshi · 2019-04-01T11:09:40Z

cupy/cuda/cupy_cub.cu

+void cub_reduce_min(void *x, void *y, int num_items,
+		    void *workspace, size_t &workspace_size, int dtype_id)
+{
+    void (*f)(void*, void*, int, void*, size_t&);


Ditto. = NULL.

niboshi · 2019-04-01T11:10:21Z

cupy/cuda/cupy_cub.cu

+size_t cub_reduce_min_get_workspace_size(void *x, void *y, int num_items,
+					 int dtype_id)
+{
+    size_t workspace_size;


Ditto. = 0.

niboshi · 2019-04-01T11:21:51Z

cupy/cuda/cupy_cub.cu

+}
+
+size_t cub_reduce_sum_get_workspace_size(void *x, void *y, int num_items,
+					 int dtype_id)


Please fix indentation.
There are some other cases.

niboshi · 2019-04-01T11:24:40Z

cupy/cuda/cupy_cub.cu

+    case CUPY_CUB_FLOAT32: f = _cub_reduce_sum<float>; break;
+    case CUPY_CUB_FLOAT64: f = _cub_reduce_sum<double>; break;
+    default:
+	std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl;


Let's throw an exception. std::invalid_argument?

niboshi · 2019-04-01T11:24:49Z

cupy/cuda/cupy_cub.cu

+    case CUPY_CUB_FLOAT32: f = _cub_reduce_min<float>; break;
+    case CUPY_CUB_FLOAT64: f = _cub_reduce_min<double>; break;
+    default:
+	std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl;


niboshi · 2019-04-01T11:27:13Z

cupy/cuda/cupy_cub.cu

+size_t cub_reduce_max_get_workspace_size(void *x, void *y, int num_items,
+					 int dtype_id)
+{
+    size_t workspace_size;


Ditto. = 0.

niboshi · 2019-04-01T11:34:26Z

cupy/cuda/cupy_cub.cu

+	std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl;
+	break;
+    }
+    (*f)(x, y, num_items, workspace, workspace_size);


(*f) can be written as just f. (ditto for other cases).

anaruse · 2019-04-02T12:16:35Z

Thanks for your review, and sorry for my mistake that I pushed a commit w/o noticing your review. Please wait for a while. I will fix the PR based on your feedback soon.

anaruse · 2019-04-02T12:25:59Z

Thanks for your review and I think all your feedback has been got reflected.
Could you check it again?

jakirkham · 2019-05-09T22:05:30Z

install/build.py


+    cub_path = os.environ.get('CUB_PATH', '')
+    if os.path.exists(cub_path):
+        include_dirs.append(cub_path)


Should it be include_dirs.append(os.path.join(cub_path, 'include'))?

Thanks for your comment and sorry for my very late reply.
"include", I though I forgot adding that, but actually it is not necessary since cub does not have "include" directory. Please see https://github.com/NVlabs/cub.

Thanks!

jakirkham · 2019-06-21T17:51:09Z

@niboshi, would it be possible to get another review? Sounds like most concerns raised have been addressed.

pentschev · 2019-06-21T19:05:02Z

cupy/cuda/cupy_cub.cu

+};
+
+void cub_reduce_sum(void *x, void *y, int num_items,
+		    void *workspace=NULL, size_t &workspace_size, int dtype_id)


Suggested change

void *workspace=NULL, size_t &workspace_size, int dtype_id)

void *workspace, size_t &workspace_size, int dtype_id)

Thanks for your suggestion!
Seems it was OK with CUDA 9.x but not with CUDA 10.x.

Ah, I was wrong. it is not OK with CUDA 9.x.

Yes, this is a C++ limitation:

In a function declaration, after a parameter with a default argument, all subsequent parameters must have a default argument supplied in this or a previous declaration from the same scope

Source: https://en.cppreference.com/w/cpp/language/default_arguments

pentschev · 2019-06-21T19:05:33Z

cupy/cuda/cupy_cub.cu

+};
+
+void cub_reduce_min(void *x, void *y, int num_items,
+		    void *workspace=NULL, size_t &workspace_size, int dtype_id)


Suggested change

void *workspace=NULL, size_t &workspace_size, int dtype_id)

void *workspace, size_t &workspace_size, int dtype_id)

pentschev · 2019-06-21T19:05:45Z

cupy/cuda/cupy_cub.cu

+};
+
+void cub_reduce_max(void *x, void *y, int num_items,
+		    void *workspace=NULL, size_t &workspace_size, int dtype_id)


Suggested change

void *workspace=NULL, size_t &workspace_size, int dtype_id)

void *workspace, size_t &workspace_size, int dtype_id)

anaruse · 2019-07-09T09:28:00Z

An environment variable CUPY_DETERMINISTIC added, so that you can change settings as for whether you allow non-deterministic computation or not. A non-deterministic computation is allowed when CUPY_DETERMINISTIC = 0 (default is 0).
In consideration of safety, reduction implementations using CUB in this branch are used only when non-deterministic computation is allowed, in other word, they are not used when CUPY_DETERMINISTIC = 1.

pentschev · 2019-07-16T06:58:12Z

Following our discussion during the PFN-NVIDIA call today, I was wondering, is there a place where we should document CUPY_DETERMINISTIC (or the name it will ultimately hold)? And as a more general question, is there a correct place to document variables like this and CUPY_EXPERIMENTAL_SLICE_COPY (introduced in #2079)?

anaruse · 2019-07-16T08:45:25Z

Thank you for discussion today. An environment variable "CUB_DISABLED" is used instead of "CUPY_DETERMINISTIC" to avoid unncecessary confusion as discussed. Now, you can disable use of CUB by setting "CUB_DISABLE=1". Default is CUB enabled.

niboshi · 2019-07-29T07:21:44Z

Jenkins, test this please

pfn-ci-bot · 2019-07-29T07:21:47Z

Successfully created a job for commit 48a8c11:

Dashboard for commit 48a8c11

chainer-ci · 2019-07-29T08:01:10Z

Jenkins CI test (for commit 48a8c11, target branch master) failed with status FAILURE.

niboshi · 2019-07-29T08:11:32Z

https://devtalk.nvidia.com/default/topic/973179/cuda-programming-and-performance/cuda-8-support-for-c-14-windows-linux/post/5004621/#5004621

Hmm, it seems C++14 is not supported in CUDA 8.

anaruse · 2019-07-30T03:47:31Z

Would it be OK to make "fast reductions by CUB" available only when CUDA 9.0 or later?

okuta · 2019-07-30T10:33:54Z

cupy/cuda/__init__.py

    return _cuda_path


+def use_cub():


Why do you use function call?
This is very small function. You can use single variable, for example cub_use.
Or, you can avoid importing cub if CUB_DISABLED` is setting.

cub_enabled = False if int(os.getenv('CUB_DISABLED', 0)): try: from cupy.cuda import cub # NOQA cub_enabled = True except ImportError: pass

Thanks for your comment. Looks it is better. I will fix it in that way.

cupy/core/_routines_statistics.pyx

anaruse · 2019-08-05T11:50:46Z

Could you run CI tests again? This PR is now available with CUDA 8.0 (C++11).

niboshi · 2019-09-10T05:27:12Z

I'm sorry, I overlooked the last comment.
Jenkins, test this please

pfn-ci-bot · 2019-09-10T05:27:15Z

Successfully created a job for commit 88d6a8d:

Dashboard for commit 88d6a8d

chainer-ci · 2019-09-10T06:03:53Z

Jenkins CI test (for commit 88d6a8d, target branch master) failed with status FAILURE.

niboshi · 2019-09-12T05:57:35Z

cupy/cuda/__init__.py

    thrust_enabled = False

+cub_enabled = False
+if int(os.getenv('CUB_DISABLED', 0)) == 0:


This is contrary to this comment, but I think it's better to have a function use_cub(), because we need to provide a way to switch the mode at runtime instead of import-time (related: #2436).

Even if there's no such immediate use case, it's safer to expose a function instead of a variable, so that we can have a greater control in the future.
For example, in chainer/chainer#2967 I tried to reflect the number of available devices in chainer.cuda.available, but gave up.

I'm a bit confused, so let me check.
You think that we should use not only an environment variable CUB_DISABLED but also a function use_cub() as before so that user can switch the mode at runtime as well, right?

This is contrary to this comment,

This is because the code snippet there was opposite to his intention :)

Is the following implementation OK?

If CuPy is NOT build with CUB
--> CUB is never used.

If CuPy is build with CUB

If an environment variable CUB_DISABLED is set
--> CUB is never used.

If an environment variable CUB_DISABLED is not set
--> CUB is used as default.

If a function disable_cub() is called.
--> CUB is not used as default.

If a function enable_cub() is called.
--> CUB is used as default.

Sorry for being unclear.
We don't need an interface for users to switch the behavior in this PR.
I think the previous implementation was better because after #2436 is implemented CuPy will need to determine the behavior at runtime. If cupy.cuda.cub_enabled is only for internal use, we don't even have to do that now. We can just implement that later.

All right. So, you are suggesting to revert the last commit (88d6a8d), right?

Yes, that's what I meant. But I'll merge the PR now anyway as that can be done later.

niboshi · 2019-09-24T09:58:54Z

Jenkins, test this please

pfn-ci-bot · 2019-09-24T09:58:58Z

Successfully created a job for commit 88d6a8d:

Dashboard for commit 88d6a8d

chainer-ci · 2019-09-24T10:37:51Z

Jenkins CI test (for commit 88d6a8d, target branch master) succeeded!

niboshi · 2019-09-25T01:04:32Z

Thanks!

anaruse · 2019-09-25T04:27:08Z

Thanks for merging the PR!

anaruse added 3 commits March 7, 2019 18:54

Speedup sum/min/max with CUB

9f39d57

flake8

3e79aa0

Fix bugs in dtype checking

f95f8fc

beam2d assigned niboshi Mar 8, 2019

kmaehashi added the cat:performance Performance in terms of speed or memory consumption label Mar 12, 2019

anaruse added 4 commits March 29, 2019 17:28

code refactoring

188ecb5

Merge branch 'master' into speedup_reductions

1a0cbd3

code refactoring

7485b24

Merge branch 'work.reduction.190329' into speedup_reductions

e22f283

Check env variable CUB_PATH

d3922e5

niboshi requested changes Apr 2, 2019

View reviewed changes

Add dtype_dispatcher

a5ede97

Fix variable uninitialization issue

4367947

jakirkham reviewed May 9, 2019

View reviewed changes

pentschev reviewed Jun 21, 2019

View reviewed changes

pentschev reviewed Jul 1, 2019

View reviewed changes

beam2d mentioned this pull request Jul 9, 2019

Improve reduction performance by cudnnReduceTensor #2294

Closed

anaruse added 2 commits July 9, 2019 17:15

Delete default argument value

fbc2b8c

Add an environment variable CUPY_DETERMINISTIC

12b113f

CUB_DISABLED used instead of CUPY_DETERMINISTIC

a1a2c31

anaruse added 2 commits July 16, 2019 17:51

Merge branch 'master' into speedup_reductions

5c1bf16

Fix build issue with --cupy-no-cuda

48a8c11

niboshi approved these changes Jul 29, 2019

View reviewed changes

okuta reviewed Jul 30, 2019

View reviewed changes

okuta requested changes Jul 31, 2019

View reviewed changes

cupy/core/_routines_statistics.pyx Show resolved Hide resolved

anaruse added 2 commits August 5, 2019 16:33

Make it buildable with C++11

f957708

Remove a function use_cub() in cupy/cuda/__init__.py

88d6a8d

niboshi reviewed Sep 12, 2019

View reviewed changes

niboshi approved these changes Sep 24, 2019

View reviewed changes

niboshi added the st:test-and-merge (deprecated) Ready to merge after test pass. label Sep 24, 2019

niboshi added this to the v7.0.0b4 milestone Sep 24, 2019

niboshi changed the title ~~Speedup sum/min/max()~~ Use CUB to speed up sum/min/max Sep 25, 2019

niboshi merged commit f2ef649 into cupy:master Sep 25, 2019

This was referenced Sep 29, 2019

Fix bug in CUB + support complex numbers using CUB #2508

Closed

Discussion for possible enhancements of the new CUB support #2519

Closed

	void *workspace=NULL, size_t &workspace_size, int dtype_id)
	void *workspace, size_t &workspace_size, int dtype_id)

Uh oh!

Conversation

anaruse commented Mar 7, 2019

Uh oh!

anaruse commented Mar 29, 2019

Uh oh!

niboshi commented Apr 1, 2019

Uh oh!

anaruse commented Apr 1, 2019

Uh oh!

anaruse commented Apr 1, 2019

Uh oh!

niboshi commented Apr 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anaruse commented Apr 2, 2019

Uh oh!

anaruse commented Apr 2, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakirkham commented Jun 21, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anaruse commented Jul 9, 2019

Uh oh!

pentschev commented Jul 16, 2019

Uh oh!

anaruse commented Jul 16, 2019

Uh oh!

niboshi commented Jul 29, 2019

Uh oh!

pfn-ci-bot commented Jul 29, 2019

Uh oh!

chainer-ci commented Jul 29, 2019

Uh oh!

niboshi commented Jul 29, 2019

Uh oh!

anaruse commented Jul 30, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

anaruse commented Aug 5, 2019

pentschev Jul 9, 2019 •

edited

Loading

anaruse Sep 17, 2019 •

edited

Loading