Performance boost: CUB-backed `_SimpleReductionKernel` by leofang · Pull Request #3244 · cupy/cupy

leofang · 2020-03-31T21:11:51Z

~~Blocked by #2584.~~ ~~Blocked by #3253.~~ ~~Blocked by #3262.~~

~~## Important note~~
~~DO NOT REVIEW the code yet!~~

To core devs: I need a quick turnaround feedback first before investing time in polishing this PR. Please read the PR description below. The current implementation is admittedly not perfect, and there're a number of things to be addressed/fixed (see TODO) in order to make code review more comfortable.

Motivation

With the cupy.cuda.cub module we were able to speed up a few commonly encountered reduction operations (sum, min, max, etc), but it would be nice to speed up all of the reduction operations currently existing in (or to be added to) CuPy (see the list here).

However, it is very tedious to add the desired support; we've seen how error-prone it was, and how many additional patches are needed to make it work. Even if we can do these for all reduction functions one by one, the maintenance is going to be difficult. Therefore, it is preferable, if possible, to build on top of the existing infrastructure for code generation in runtime.

What is being done

This PR achieves the goal by providing an alternative reduction kernel implementation based on cub::BlockReduce. All reduction functions returned by create_reduction_func, which is in turn based on _SimpleReductionKernel, will be benefited if the reduced axes are contiguous (can be in either C or F order). The net effect is as if cub::DeviceReduce or cub::DeviceSegmentedReduce were used, depending on whether it's a full or partial reduction.

What is not included

Generalization to arbitrary ReductionKernel: I think it'll have to wait until this PR is merged, as I don't understand a few things there. I will likely need to discuss with the core devs offline.
Support NVRTC: currently this PR requires using nvcc (see below) due to NVlabs/cub#131, but I think as suggested there this issue can be circumvented by using jitify.
Reduction over non-contiguous arrays: I don't think it's possible with CUB...

Implementation detail & limitations

~~Will add this later.~~ See #3244 (comment).

Performance benchmarks

Using this script in which all reduction functions in CuPy (searched via grep create_reduction_func) are tested and benchmarked, I see that in most cases, especially a full reduction, a significant boost is achieved.

For CUDA 9.2 + P100, this is what I got.
For CUDA 10.0 + GTX 2080 Ti, this is what I got.
(UPDATE): #3244 (comment)

TODO

Merge with the current master: The conflict is due to Merge elementwise implementations #2920, which I need more time to understand, so I won't proceed further unless I get a green light from the core devs.
Merge cupy/core/_cub_simple_reduction.py back to cupy/core/_reduction.pyx. (I added the kernel in a .py file for faster dev & test cycles.)
Discuss how to enable/disable this feature to make it easier to both users and our CIs. Toggling on/off this feature is currently done by the new flag cupy.core.cub_block_reduction_enabled, but frankly I am not a fan of it.
Analyze why in certain cases CUB does not win
Provide some architecture tuning parameters such as ITEMS_PER_THREAD, BLOCK_SIZE, etc. This is how it's done in CUB, but perhaps we can defer this to the next PR.
Clean up all the mess and polish the code.
~~(Optional) Force-push to clean up the commit history~~ (need to delete 3e8eb62)

…sting

- better debug output - do CUB reduction only for C-contiguous cases - add an experimental switch

- F-order reduction is OK - "essentially 1D" reduction is OK - sometimes a C-order input is misidentified as F-order => why?!

…cleanup

This is done by 1. seperate all compile-time constants from the rest 2. pass in the latter as runtime arguments

We need to split the code path cleanly so that we don't accidentally invoke CUB kernels while we shouldn't. This is done by adding a bool try_use_cub to mae the hint. It's default to False for ReductionKernel and True for _SimpleReductionKernel.

… code test_reduction.py and test_sumprod.py are passed!

This is a super ugly hack that I don't like...But it seems to work.

It is clear that a two-pass launch is necessary to be close to the performance of cub::DeviceReduce (for reducing whole array). However, in order to do this we need to allocate temorary buffer of type_reduce. This is not possible unless CuPy exposes its memory pool to C/C++ space so that we can call sizeof(type_reduce).

emcastillo · 2020-06-12T01:37:33Z

Is it not possible to add a _AbstractCUBReductionKernel that inherits _AbstractCUBReductionKernel and overrides whats needed?

leofang · 2020-06-12T13:17:26Z

Did you mean to inherit _AbstractReductionKernel and create a _SimpleCUBReductionKernel? Yes, it's always possible. The problem is we must hijack and split the code path somewhere, so the problem is three-fold:

If we do any alternative approach (other than the current implementation), where to split?
At the split point, the code would be a bit messy (due to the need to test some conditions to decide which path and prepare inputs). Can we accept that?
Would the proposal be flexible enough that we can easily extend it later (see Performance boost: CUB-backed _SimpleReductionKernel #3244 (comment))?

My concern for No. 1 is that the split point may have to happen as early as create_reduction_func is called if we do this inheritance, but we won't be prepared sufficiently to make any judgement (for using CUB or not) at that point. I don't see where in the existing code path the hypothetical _SimpleCUBReductionKernel can enter .

@asi1024 @emcastillo I think in the above discussions we are kinda mix these issues together. I will wait for a consensus to be reached and proceed to change.

emcastillo · 2020-06-13T04:49:10Z

I think that all the points that you mention make complete sense.

I will merge the PR in its current state and then try to re-factor it myself while working with @asi1024 in an extensible design.

Sorry for kept asking changes.

emcastillo · 2020-06-13T04:49:36Z

Jenkins, test this please

pfn-ci-bot · 2020-06-13T04:49:40Z

Successfully created a job for commit a4acb81:

Dashboard for commit a4acb81

chainer-ci · 2020-06-13T05:42:30Z

Jenkins CI test (for commit a4acb81, target branch master) succeeded!

leofang · 2020-06-13T06:19:23Z

That was quick! Thanks a lot @emcastillo! Yeah no worries, those are perfectly legit requests. I just don't have good ideas (for now). We can work together to improve the designs.

I will also in the meantime start looking into hijacking create_ufunc (don't be mad at me 🙇‍♂️) and see if CUB helps (I hope it does!)

asi1024 · 2020-06-15T17:23:25Z

@leofang What is the difference between CUB_PATH and CUPY_CUB_PATH?

leofang · 2020-06-15T17:32:18Z

@asi1024 Thanks for asking, I should have remembered to leave a note 😅

Currently:

CUB_PATH is used at build time to build cupy.cuda.cub
CUPY_CUB_PATH is used at runtime to support this PR (see Performance boost: CUB-backed _SimpleReductionKernel #3244 (comment))

After #2584 is merged:

CUB_PATH is used at build time to build cupy.cuda.cub, but will be deprecated
CUPY_CUB_PATH is used at build time to build cupy.cuda.cub and at runtime to support this PR

For the purpose of supporting Conda-Forge, it's best to have all package-specific env vars to be prefixed by the package name (i.e., CUPY_) whenever possible, so that we can make better distinction and reduce interference among packages.

asi1024 · 2020-06-15T18:04:12Z

@leofang Thanks!

jakirkham · 2020-06-15T18:34:33Z

Do you have some recent benchmarks with these changes Leo?

leofang · 2020-06-15T18:53:00Z

Hi John, see #3244 (comment) (sorry GitHub folded the long thread...) We might get some slight improvement in master, because after posting that result I also did a few cosmetic changes (loop unroll, inline, etc). If you need I can rerun the script.

asi1024 · 2020-06-16T03:33:58Z

@leofang Can we remove these lines for each routine now?

cupy/cupy/core/_routines_math.pyx

Lines 94 to 99 in 57235e7

    
           if cupy.cuda.cub_enabled: 
        
               # result will be None if the reduction is not compatible with CUB 
        
               result = cub.cub_reduction(self, cub.CUPY_CUB_SUM, axis, dtype, out, 
        
                                          keepdims) 
        
               if result is not None: 
        
                   return result

leofang · 2020-06-16T05:24:20Z

Not really, no. Yeah it would be great if these branches (which means all reduction routines from cupy.cuda.cub!) could be eliminated, but AFAIK they need to stay for now for a few reasons:

It seems when they are usable, they are usually the fastest, although there are exceptions;

There is a possibility that nvcc is not in PATH (say using conda as in Import fails although CUDA is present #3403, or on some HPC computing nodes to which the full compiler toolchain is not visible) so this PR would not be usable:

cupy/cupy/core/_reduction.pyx

Lines 350 to 352 in 9c60736

    
           # rare event (mainly for conda-forge users): nvcc is not found! 
        
           if _environment.get_nvcc_path() is None: 
        
               return None

I would suggest to first merge #2584 and then evaluate.

emcastillo · 2020-06-16T05:32:29Z

Exactly, the cub reductions in some routines are special cases in cub that are (or should be) more optimized than the general reduction algorithm.

leofang · 2020-06-16T06:15:07Z

Ah, forgot to add -- perhaps it is obvious -- that the current implementation ensures the calling precedence is 1. routines from cupy.cuda.cub, 2. this PR, 3. CuPy's original reduction implementation.

leofang added 30 commits March 16, 2020 10:55

[WIP] print arguments for investigation

dcb6067

implement _ArgInfo.__repr__

deda52a

[WIP] rearrange printed arguments

7b12600

[WIP] this compiles (kernel incomplete)

cce2806

[WIP] add _reduction_op

f959c68

[WIP]

9e0e39d

[WIP] this compiles (and result is of course wrong)

253ef22

[WIP] temporarily move the code template to python file for easier te…

ffeaece

…sting

[WIP] see logs below

8d537df

- better debug output - do CUB reduction only for C-contiguous cases - add an experimental switch

[WIP] should be ready to work on CUB kernel

861a066

[WIP] seems to work for small arrays!

8a266b2

[WIP] fixed the bug with the last tile!

821bc1b

[WIP] fixed a few bugs; one still not fixed!

0ea8380

- F-order reduction is OK - "essentially 1D" reduction is OK - sometimes a C-order input is misidentified as F-order => why?!

[WIP] fix order misidentification bug!

ae4792e

[WIP] pass items_per_thread from Cython

2398878

[WIP] remove CIndexer from kernel; move flag from cuda to core; code …

445bda9

…cleanup

[WIP] support pre_map_expr and for-loop load

5808eb8

[WIP] make inputs simply void*; a little bit cleanup

c9a5335

[WIP] increase chances of kernel reuse

0f9af8a

This is done by 1. seperate all compile-time constants from the rest 2. pass in the latter as runtime arguments

[WIP] temporarily surpress all print statements

7b755bf

fixed a kernel bug; exclude certain castings requiring changes in C++…

effb7e0

… code test_reduction.py and test_sumprod.py are passed!

fix a bug for identity=None

6b4d4ca

fix "_J" missing bug (cf: cupy#2522)

83b7729

This is a super ugly hack that I don't like...But it seems to work.

fix mean bug

3b3462d

assume CUB headers are bundled

6843030

bring back CUB device sum

7dd53c0

[WIP] two-pass reduction worksgit add -u! (tested sum, min, max)

9d44db9

[WIP] fix _J issue (again) for two-pass reduction; remove some prints

5961a8f

leofang mentioned this pull request Jun 11, 2020

Copy Memory-mapped file into CuPy array #3431

Closed

emcastillo approved these changes Jun 13, 2020

View reviewed changes

emcastillo added st:test-and-merge (deprecated) Ready to merge after test pass. cat:feature New features/APIs labels Jun 13, 2020

mergify bot merged commit 5b54311 into cupy:master Jun 13, 2020

leofang deleted the cupy_reduce_cub_backend2 branch June 13, 2020 06:09

asi1024 mentioned this pull request Jun 16, 2020

Refactor CUB-backed _SimpleReductionKernel #3443

Merged

kmaehashi mentioned this pull request Jun 17, 2020

Reorganize CUB environment variables #3445

Closed

This was referenced Jun 19, 2020

Add tests for _cub_reduction #3462

Merged

[WIP] Use fuse to improve norm #3273

Closed

This was referenced Jun 29, 2020

Auto-tune interface for developers/users #3510

Open

Small fixes for CUB block reduction kernels #3520

Merged

leofang mentioned this pull request Jul 12, 2020

Add docs to build cub module in Ubuntu 16.04 / RHEL 7 #3530

Closed

leofang mentioned this pull request Nov 16, 2020

Support CUB-backed cupy.ReductionKernel and @cupy.fuse #4289

Merged

Uh oh!

Conversation

leofang commented Mar 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What is being done

What is not included

Implementation detail & limitations

Performance benchmarks

TODO

Uh oh!

emcastillo commented Jun 12, 2020

Uh oh!

leofang commented Jun 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcastillo commented Jun 13, 2020

Uh oh!

emcastillo commented Jun 13, 2020

Uh oh!

pfn-ci-bot commented Jun 13, 2020

Uh oh!

chainer-ci commented Jun 13, 2020

Uh oh!

leofang commented Jun 13, 2020

Uh oh!

asi1024 commented Jun 15, 2020

Uh oh!

leofang commented Jun 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asi1024 commented Jun 15, 2020

Uh oh!

jakirkham commented Jun 15, 2020

Uh oh!

leofang commented Jun 15, 2020

Uh oh!

asi1024 commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leofang commented Jun 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emcastillo commented Jun 16, 2020

Uh oh!

leofang commented Jun 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

leofang commented Mar 31, 2020 •

edited

Loading

leofang commented Jun 12, 2020 •

edited

Loading

leofang commented Jun 15, 2020 •

edited

Loading

asi1024 commented Jun 16, 2020 •

edited

Loading

leofang commented Jun 16, 2020 •

edited

Loading