Skip to content

Performance boost: CUB-backed _SimpleReductionKernel#3244

Merged
mergify[bot] merged 62 commits intocupy:masterfrom
leofang:cupy_reduce_cub_backend2
Jun 13, 2020
Merged

Performance boost: CUB-backed _SimpleReductionKernel#3244
mergify[bot] merged 62 commits intocupy:masterfrom
leofang:cupy_reduce_cub_backend2

Conversation

@leofang
Copy link
Copy Markdown
Member

@leofang leofang commented Mar 31, 2020

Blocked by #2584. Blocked by #3253. Blocked by #3262.

## Important note
DO NOT REVIEW the code yet!

To core devs: I need a quick turnaround feedback first before investing time in polishing this PR. Please read the PR description below. The current implementation is admittedly not perfect, and there're a number of things to be addressed/fixed (see TODO) in order to make code review more comfortable.

Motivation

With the cupy.cuda.cub module we were able to speed up a few commonly encountered reduction operations (sum, min, max, etc), but it would be nice to speed up all of the reduction operations currently existing in (or to be added to) CuPy (see the list here).

However, it is very tedious to add the desired support; we've seen how error-prone it was, and how many additional patches are needed to make it work. Even if we can do these for all reduction functions one by one, the maintenance is going to be difficult. Therefore, it is preferable, if possible, to build on top of the existing infrastructure for code generation in runtime.

What is being done

This PR achieves the goal by providing an alternative reduction kernel implementation based on cub::BlockReduce. All reduction functions returned by create_reduction_func, which is in turn based on _SimpleReductionKernel, will be benefited if the reduced axes are contiguous (can be in either C or F order). The net effect is as if cub::DeviceReduce or cub::DeviceSegmentedReduce were used, depending on whether it's a full or partial reduction.

What is not included

  • Generalization to arbitrary ReductionKernel: I think it'll have to wait until this PR is merged, as I don't understand a few things there. I will likely need to discuss with the core devs offline.
  • Support NVRTC: currently this PR requires using nvcc (see below) due to NVlabs/cub#131, but I think as suggested there this issue can be circumvented by using jitify.
  • Reduction over non-contiguous arrays: I don't think it's possible with CUB...

Implementation detail & limitations

Will add this later. See #3244 (comment).

Performance benchmarks

Using this script in which all reduction functions in CuPy (searched via grep create_reduction_func) are tested and benchmarked, I see that in most cases, especially a full reduction, a significant boost is achieved.

For CUDA 9.2 + P100, this is what I got.
For CUDA 10.0 + GTX 2080 Ti, this is what I got.
(UPDATE): #3244 (comment)

TODO

  • Merge with the current master: The conflict is due to Merge elementwise implementations #2920, which I need more time to understand, so I won't proceed further unless I get a green light from the core devs.
  • Merge cupy/core/_cub_simple_reduction.py back to cupy/core/_reduction.pyx. (I added the kernel in a .py file for faster dev & test cycles.)
  • Discuss how to enable/disable this feature to make it easier to both users and our CIs. Toggling on/off this feature is currently done by the new flag cupy.core.cub_block_reduction_enabled, but frankly I am not a fan of it.
  • Analyze why in certain cases CUB does not win
  • Provide some architecture tuning parameters such as ITEMS_PER_THREAD, BLOCK_SIZE, etc. This is how it's done in CUB, but perhaps we can defer this to the next PR.
  • Clean up all the mess and polish the code.
  • (Optional) Force-push to clean up the commit history (need to delete 3e8eb62)

leofang added 30 commits March 16, 2020 10:55
- better debug output
- do CUB reduction only for C-contiguous cases
- add an experimental switch
- F-order reduction is OK
- "essentially 1D" reduction is OK
- sometimes a C-order input is misidentified as F-order => why?!
This is done by
1. seperate all compile-time constants from the rest
2. pass in the latter as runtime arguments
We need to split the code path cleanly so that we don't accidentally
invoke CUB kernels while we shouldn't. This is done by adding a bool
try_use_cub to mae the hint. It's default to False for ReductionKernel
and True for _SimpleReductionKernel.
… code

test_reduction.py and test_sumprod.py are passed!
This is a super ugly hack that I don't like...But it seems to work.
It is clear that a two-pass launch is necessary to be close to the
performance of cub::DeviceReduce (for reducing whole array). However,
in order to do this we need to allocate temorary buffer of type_reduce.
This is not possible unless CuPy exposes its memory pool to C/C++
space so that we can call sizeof(type_reduce).
@emcastillo
Copy link
Copy Markdown
Member

Is it not possible to add a _AbstractCUBReductionKernel that inherits _AbstractCUBReductionKernel and overrides whats needed?

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 12, 2020

Did you mean to inherit _AbstractReductionKernel and create a _SimpleCUBReductionKernel? Yes, it's always possible. The problem is we must hijack and split the code path somewhere, so the problem is three-fold:

  1. If we do any alternative approach (other than the current implementation), where to split?
  2. At the split point, the code would be a bit messy (due to the need to test some conditions to decide which path and prepare inputs). Can we accept that?
  3. Would the proposal be flexible enough that we can easily extend it later (see Performance boost: CUB-backed _SimpleReductionKernel #3244 (comment))?

My concern for No. 1 is that the split point may have to happen as early as create_reduction_func is called if we do this inheritance, but we won't be prepared sufficiently to make any judgement (for using CUB or not) at that point. I don't see where in the existing code path the hypothetical _SimpleCUBReductionKernel can enter .

@asi1024 @emcastillo I think in the above discussions we are kinda mix these issues together. I will wait for a consensus to be reached and proceed to change.

@emcastillo
Copy link
Copy Markdown
Member

I think that all the points that you mention make complete sense.

I will merge the PR in its current state and then try to re-factor it myself while working with @asi1024 in an extensible design.

Sorry for kept asking changes.

@emcastillo
Copy link
Copy Markdown
Member

Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit a4acb81:

@emcastillo emcastillo added st:test-and-merge (deprecated) Ready to merge after test pass. cat:feature New features/APIs labels Jun 13, 2020
@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit a4acb81, target branch master) succeeded!

@mergify mergify bot merged commit 5b54311 into cupy:master Jun 13, 2020
@leofang leofang deleted the cupy_reduce_cub_backend2 branch June 13, 2020 06:09
@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 13, 2020

That was quick! Thanks a lot @emcastillo! Yeah no worries, those are perfectly legit requests. I just don't have good ideas (for now). We can work together to improve the designs.

I will also in the meantime start looking into hijacking create_ufunc (don't be mad at me 🙇‍♂️) and see if CUB helps (I hope it does!)

@asi1024
Copy link
Copy Markdown
Member

asi1024 commented Jun 15, 2020

@leofang What is the difference between CUB_PATH and CUPY_CUB_PATH?

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 15, 2020

@asi1024 Thanks for asking, I should have remembered to leave a note 😅

Currently:

After #2584 is merged:

  • CUB_PATH is used at build time to build cupy.cuda.cub, but will be deprecated
  • CUPY_CUB_PATH is used at build time to build cupy.cuda.cub and at runtime to support this PR

For the purpose of supporting Conda-Forge, it's best to have all package-specific env vars to be prefixed by the package name (i.e., CUPY_) whenever possible, so that we can make better distinction and reduce interference among packages.

@asi1024
Copy link
Copy Markdown
Member

asi1024 commented Jun 15, 2020

@leofang Thanks!

@jakirkham
Copy link
Copy Markdown
Member

Do you have some recent benchmarks with these changes Leo?

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 15, 2020

Hi John, see #3244 (comment) (sorry GitHub folded the long thread...) We might get some slight improvement in master, because after posting that result I also did a few cosmetic changes (loop unroll, inline, etc). If you need I can rerun the script.

@asi1024
Copy link
Copy Markdown
Member

asi1024 commented Jun 16, 2020

@leofang Can we remove these lines for each routine now?

if cupy.cuda.cub_enabled:
# result will be None if the reduction is not compatible with CUB
result = cub.cub_reduction(self, cub.CUPY_CUB_SUM, axis, dtype, out,
keepdims)
if result is not None:
return result

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 16, 2020

Not really, no. Yeah it would be great if these branches (which means all reduction routines from cupy.cuda.cub!) could be eliminated, but AFAIK they need to stay for now for a few reasons:

  1. It seems when they are usable, they are usually the fastest, although there are exceptions;
  2. There is a possibility that nvcc is not in PATH (say using conda as in Import fails although CUDA is present #3403, or on some HPC computing nodes to which the full compiler toolchain is not visible) so this PR would not be usable:
    # rare event (mainly for conda-forge users): nvcc is not found!
    if _environment.get_nvcc_path() is None:
    return None

I would suggest to first merge #2584 and then evaluate.

@emcastillo
Copy link
Copy Markdown
Member

Exactly, the cub reductions in some routines are special cases in cub that are (or should be) more optimized than the general reduction algorithm.

@leofang
Copy link
Copy Markdown
Member Author

leofang commented Jun 16, 2020

Ah, forgot to add -- perhaps it is obvious -- that the current implementation ensures the calling precedence is 1. routines from cupy.cuda.cub, 2. this PR, 3. CuPy's original reduction implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:feature New features/APIs st:test-and-merge (deprecated) Ready to merge after test pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants