Performance boost: CUB-backed _SimpleReductionKernel#3244
Performance boost: CUB-backed _SimpleReductionKernel#3244mergify[bot] merged 62 commits intocupy:masterfrom
_SimpleReductionKernel#3244Conversation
- better debug output - do CUB reduction only for C-contiguous cases - add an experimental switch
- F-order reduction is OK - "essentially 1D" reduction is OK - sometimes a C-order input is misidentified as F-order => why?!
This is done by 1. seperate all compile-time constants from the rest 2. pass in the latter as runtime arguments
We need to split the code path cleanly so that we don't accidentally invoke CUB kernels while we shouldn't. This is done by adding a bool try_use_cub to mae the hint. It's default to False for ReductionKernel and True for _SimpleReductionKernel.
… code test_reduction.py and test_sumprod.py are passed!
This is a super ugly hack that I don't like...But it seems to work.
It is clear that a two-pass launch is necessary to be close to the performance of cub::DeviceReduce (for reducing whole array). However, in order to do this we need to allocate temorary buffer of type_reduce. This is not possible unless CuPy exposes its memory pool to C/C++ space so that we can call sizeof(type_reduce).
|
Is it not possible to add a |
|
Did you mean to inherit
My concern for No. 1 is that the split point may have to happen as early as @asi1024 @emcastillo I think in the above discussions we are kinda mix these issues together. I will wait for a consensus to be reached and proceed to change. |
|
I think that all the points that you mention make complete sense. I will merge the PR in its current state and then try to re-factor it myself while working with @asi1024 in an extensible design. Sorry for kept asking changes. |
|
Jenkins, test this please |
|
Successfully created a job for commit a4acb81: |
|
Jenkins CI test (for commit a4acb81, target branch master) succeeded! |
|
That was quick! Thanks a lot @emcastillo! Yeah no worries, those are perfectly legit requests. I just don't have good ideas (for now). We can work together to improve the designs. I will also in the meantime start looking into hijacking |
|
@leofang What is the difference between |
|
@asi1024 Thanks for asking, I should have remembered to leave a note 😅 Currently:
After #2584 is merged:
For the purpose of supporting Conda-Forge, it's best to have all package-specific env vars to be prefixed by the package name (i.e., |
|
@leofang Thanks! |
|
Do you have some recent benchmarks with these changes Leo? |
|
Hi John, see #3244 (comment) (sorry GitHub folded the long thread...) We might get some slight improvement in master, because after posting that result I also did a few cosmetic changes (loop unroll, inline, etc). If you need I can rerun the script. |
|
@leofang Can we remove these lines for each routine now? cupy/cupy/core/_routines_math.pyx Lines 94 to 99 in 57235e7 |
|
Not really, no. Yeah it would be great if these branches (which means all reduction routines from
I would suggest to first merge #2584 and then evaluate. |
|
Exactly, the cub reductions in some routines are special cases in cub that are (or should be) more optimized than the general reduction algorithm. |
|
Ah, forgot to add -- perhaps it is obvious -- that the current implementation ensures the calling precedence is 1. routines from |
Blocked by #2584.Blocked by #3253.Blocked by #3262.## Important noteDO NOT REVIEW the code yet!To core devs: I need a quick turnaround feedback first before investing time in polishing this PR. Please read the PR description below. The current implementation is admittedly not perfect, and there're a number of things to be addressed/fixed (see TODO) in order to make code review more comfortable.Motivation
With the
cupy.cuda.cubmodule we were able to speed up a few commonly encountered reduction operations (sum,min,max, etc), but it would be nice to speed up all of the reduction operations currently existing in (or to be added to) CuPy (see the list here).However, it is very tedious to add the desired support; we've seen how error-prone it was, and how many additional patches are needed to make it work. Even if we can do these for all reduction functions one by one, the maintenance is going to be difficult. Therefore, it is preferable, if possible, to build on top of the existing infrastructure for code generation in runtime.
What is being done
This PR achieves the goal by providing an alternative reduction kernel implementation based on
cub::BlockReduce. All reduction functions returned bycreate_reduction_func, which is in turn based on_SimpleReductionKernel, will be benefited if the reduced axes are contiguous (can be in either C or F order). The net effect is as ifcub::DeviceReduceorcub::DeviceSegmentedReducewere used, depending on whether it's a full or partial reduction.What is not included
ReductionKernel: I think it'll have to wait until this PR is merged, as I don't understand a few things there. I will likely need to discuss with the core devs offline.nvcc(see below) due to NVlabs/cub#131, but I think as suggested there this issue can be circumvented by using jitify.Implementation detail & limitations
Will add this later.See #3244 (comment).Performance benchmarks
Using this script in which all reduction functions in CuPy (searched via grep
create_reduction_func) are tested and benchmarked, I see that in most cases, especially a full reduction, a significant boost is achieved.For CUDA 9.2 + P100, this is what I got.
For CUDA 10.0 + GTX 2080 Ti, this is what I got.
(UPDATE): #3244 (comment)
TODO
cupy/core/_cub_simple_reduction.pyback tocupy/core/_reduction.pyx. (I added the kernel in a .py file for faster dev & test cycles.)cupy.core.cub_block_reduction_enabled, but frankly I am not a fan of it.ITEMS_PER_THREAD,BLOCK_SIZE, etc. This is how it's done in CUB, but perhaps we can defer this to the next PR.(Optional) Force-push to clean up the commit history(need to delete 3e8eb62)