Skip to content

Use CUB to speed up sum/min/max#2090

Merged
niboshi merged 17 commits intocupy:masterfrom
anaruse:speedup_reductions
Sep 25, 2019
Merged

Use CUB to speed up sum/min/max#2090
niboshi merged 17 commits intocupy:masterfrom
anaruse:speedup_reductions

Conversation

@anaruse
Copy link
Copy Markdown
Contributor

@anaruse anaruse commented Mar 7, 2019

This PR aims to improve performance of reductions in CuPy using CUB (https://github.com/NVlabs/cub). This is related to #2085.

The performance of sum/min/max() will be greatly improved with this PR specifically when array size is large. The followings are results of performance measure of max() using the script in #2085.

cupy w/o CUB : 948.2791748046875 ms
cupy with CUB : 6.752448081970215 ms
numpy : 281.2850341796875 ms

Note: this PR only supports "single reduction (reducing the array to a single element)" for now, and does not support "batched reduction". Thus, CUB implementations are not used when you specify a argument axis of sum/min/max().

@kmaehashi kmaehashi added the cat:performance Performance in terms of speed or memory consumption label Mar 12, 2019
@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Mar 29, 2019

I've modified the PR to make it easy-to-maintain.
Could you review it if the design is OK or not?

@niboshi
Copy link
Copy Markdown
Member

niboshi commented Apr 1, 2019

Could you tell how a user can enable cub and how cupy checks for it?

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Apr 1, 2019

I suppose your question is about how to build this PR with cub enabled?
I thought you could do it by just adding a path of CUB to CFLAGS env variable, but I noticed it did not work. It seems that CFLAGS is not referred when building *.cu files by nvcc. I will check it, so please wait.
BTW, if there is a link to CUB path in ${CUDA_PATH}/include, you can enable cub.

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Apr 1, 2019

I've modified the PR, so that it check a environment variable CUB_PATH in build time.
You can build this PR with CUB enabled by setting CUB_PATH like below.

CUB_PATH=/opt/cub/cub-1.8.0 python setup.py install

@niboshi
Copy link
Copy Markdown
Member

niboshi commented Apr 1, 2019

Thanks, I'll give it a try!

setup.py Outdated
'core/include/cupy/_cuda/cuda-*/*.h',
'core/include/cupy/_cuda/cuda-*/*.hpp',
'cuda/cupy_thrust.cu',
'cuda/cupy_cub.cu',
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this line is not needed.

void cub_reduce_sum(void *x, void *y, int num_items,
void *workspace, size_t &workspace_size, int dtype_id)
{
void (*f)(void*, void*, int, void*, size_t&);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid uninitialized variable by assigning = NULL.

size_t cub_reduce_sum_get_workspace_size(void *x, void *y, int num_items,
int dtype_id)
{
size_t workspace_size;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. = 0.

void cub_reduce_min(void *x, void *y, int num_items,
void *workspace, size_t &workspace_size, int dtype_id)
{
void (*f)(void*, void*, int, void*, size_t&);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. = NULL.

size_t cub_reduce_min_get_workspace_size(void *x, void *y, int num_items,
int dtype_id)
{
size_t workspace_size;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. = 0.

}

size_t cub_reduce_sum_get_workspace_size(void *x, void *y, int num_items,
int dtype_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix indentation.
There are some other cases.

case CUPY_CUB_FLOAT32: f = _cub_reduce_sum<float>; break;
case CUPY_CUB_FLOAT64: f = _cub_reduce_sum<double>; break;
default:
std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's throw an exception. std::invalid_argument?

case CUPY_CUB_FLOAT32: f = _cub_reduce_min<float>; break;
case CUPY_CUB_FLOAT64: f = _cub_reduce_min<double>; break;
default:
std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

size_t cub_reduce_max_get_workspace_size(void *x, void *y, int num_items,
int dtype_id)
{
size_t workspace_size;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. = 0.

std::cerr << "Unsupported dtype ID: " << dtype_id << std::endl;
break;
}
(*f)(x, y, num_items, workspace, workspace_size);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(*f) can be written as just f. (ditto for other cases).

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Apr 2, 2019

Thanks for your review, and sorry for my mistake that I pushed a commit w/o noticing your review. Please wait for a while. I will fix the PR based on your feedback soon.

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Apr 2, 2019

Thanks for your review and I think all your feedback has been got reflected.
Could you check it again?


cub_path = os.environ.get('CUB_PATH', '')
if os.path.exists(cub_path):
include_dirs.append(cub_path)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be include_dirs.append(os.path.join(cub_path, 'include'))?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment and sorry for my very late reply.
"include", I though I forgot adding that, but actually it is not necessary since cub does not have "include" directory. Please see https://github.com/NVlabs/cub.

Thanks!

@jakirkham
Copy link
Copy Markdown
Member

@niboshi, would it be possible to get another review? Sounds like most concerns raised have been addressed.

};

void cub_reduce_sum(void *x, void *y, int num_items,
void *workspace=NULL, size_t &workspace_size, int dtype_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void *workspace=NULL, size_t &workspace_size, int dtype_id)
void *workspace, size_t &workspace_size, int dtype_id)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion!
Seems it was OK with CUDA 9.x but not with CUDA 10.x.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was wrong. it is not OK with CUDA 9.x.

Copy link
Copy Markdown
Member

@pentschev pentschev Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a C++ limitation:

In a function declaration, after a parameter with a default argument, all subsequent parameters must have a default argument supplied in this or a previous declaration from the same scope

Source: https://en.cppreference.com/w/cpp/language/default_arguments

};

void cub_reduce_min(void *x, void *y, int num_items,
void *workspace=NULL, size_t &workspace_size, int dtype_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void *workspace=NULL, size_t &workspace_size, int dtype_id)
void *workspace, size_t &workspace_size, int dtype_id)

};

void cub_reduce_max(void *x, void *y, int num_items,
void *workspace=NULL, size_t &workspace_size, int dtype_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void *workspace=NULL, size_t &workspace_size, int dtype_id)
void *workspace, size_t &workspace_size, int dtype_id)

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Jul 9, 2019

An environment variable CUPY_DETERMINISTIC added, so that you can change settings as for whether you allow non-deterministic computation or not. A non-deterministic computation is allowed when CUPY_DETERMINISTIC = 0 (default is 0).
In consideration of safety, reduction implementations using CUB in this branch are used only when non-deterministic computation is allowed, in other word, they are not used when CUPY_DETERMINISTIC = 1.

@pentschev
Copy link
Copy Markdown
Member

Following our discussion during the PFN-NVIDIA call today, I was wondering, is there a place where we should document CUPY_DETERMINISTIC (or the name it will ultimately hold)? And as a more general question, is there a correct place to document variables like this and CUPY_EXPERIMENTAL_SLICE_COPY (introduced in #2079)?

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Jul 16, 2019

Thank you for discussion today. An environment variable "CUB_DISABLED" is used instead of "CUPY_DETERMINISTIC" to avoid unncecessary confusion as discussed. Now, you can disable use of CUB by setting "CUB_DISABLE=1". Default is CUB enabled.

@niboshi
Copy link
Copy Markdown
Member

niboshi commented Jul 29, 2019

Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit 48a8c11:

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 48a8c11, target branch master) failed with status FAILURE.

@niboshi
Copy link
Copy Markdown
Member

niboshi commented Jul 29, 2019

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Jul 30, 2019

Would it be OK to make "fast reductions by CUB" available only when CUDA 9.0 or later?

return _cuda_path


def use_cub():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use function call?
This is very small function. You can use single variable, for example cub_use.
Or, you can avoid importing cub if CUB_DISABLED` is setting.

cub_enabled = False
if int(os.getenv('CUB_DISABLED', 0)):
    try:
        from cupy.cuda import cub  # NOQA
        cub_enabled = True
    except ImportError:
        pass

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment. Looks it is better. I will fix it in that way.

@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Aug 5, 2019

Could you run CI tests again? This PR is now available with CUDA 8.0 (C++11).

@niboshi
Copy link
Copy Markdown
Member

niboshi commented Sep 10, 2019

I'm sorry, I overlooked the last comment.
Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit 88d6a8d:

@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 88d6a8d, target branch master) failed with status FAILURE.

thrust_enabled = False

cub_enabled = False
if int(os.getenv('CUB_DISABLED', 0)) == 0:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contrary to this comment, but I think it's better to have a function use_cub(), because we need to provide a way to switch the mode at runtime instead of import-time (related: #2436).

Even if there's no such immediate use case, it's safer to expose a function instead of a variable, so that we can have a greater control in the future.
For example, in chainer/chainer#2967 I tried to reflect the number of available devices in chainer.cuda.available, but gave up.

Copy link
Copy Markdown
Contributor Author

@anaruse anaruse Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused, so let me check.
You think that we should use not only an environment variable CUB_DISABLED but also a function use_cub() as before so that user can switch the mode at runtime as well, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is contrary to this comment,

This is because the code snippet there was opposite to his intention :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the following implementation OK?

  1. If CuPy is NOT build with CUB
    --> CUB is never used.
  2. If CuPy is build with CUB
    1. If an environment variable CUB_DISABLED is set
      --> CUB is never used.
    2. If an environment variable CUB_DISABLED is not set
      --> CUB is used as default.
      1. If a function disable_cub() is called.
        --> CUB is not used as default.
      2. If a function enable_cub() is called.
        --> CUB is used as default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unclear.
We don't need an interface for users to switch the behavior in this PR.
I think the previous implementation was better because after #2436 is implemented CuPy will need to determine the behavior at runtime. If cupy.cuda.cub_enabled is only for internal use, we don't even have to do that now. We can just implement that later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right. So, you are suggesting to revert the last commit (88d6a8d), right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant. But I'll merge the PR now anyway as that can be done later.

@niboshi
Copy link
Copy Markdown
Member

niboshi commented Sep 24, 2019

Jenkins, test this please

@pfn-ci-bot
Copy link
Copy Markdown
Collaborator

Successfully created a job for commit 88d6a8d:

@niboshi niboshi added the st:test-and-merge (deprecated) Ready to merge after test pass. label Sep 24, 2019
@niboshi niboshi added this to the v7.0.0b4 milestone Sep 24, 2019
@chainer-ci
Copy link
Copy Markdown
Member

Jenkins CI test (for commit 88d6a8d, target branch master) succeeded!

@niboshi niboshi changed the title Speedup sum/min/max() Use CUB to speed up sum/min/max Sep 25, 2019
@niboshi
Copy link
Copy Markdown
Member

niboshi commented Sep 25, 2019

Thanks!

@niboshi niboshi merged commit f2ef649 into cupy:master Sep 25, 2019
@anaruse
Copy link
Copy Markdown
Contributor Author

anaruse commented Sep 25, 2019

Thanks for merging the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cat:performance Performance in terms of speed or memory consumption st:test-and-merge (deprecated) Ready to merge after test pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants