Skip to content

Check for maximum numel in NCCL broadcasting#11466

Closed
ssnl wants to merge 3 commits intopytorch:masterfrom
ssnl:nccl_max
Closed

Check for maximum numel in NCCL broadcasting#11466
ssnl wants to merge 3 commits intopytorch:masterfrom
ssnl:nccl_max

Conversation

@ssnl
Copy link
Collaborator

@ssnl ssnl commented Sep 10, 2018

NCCL1 uses int as its numerical type for fields like count, which makes broadcasting tensors larger than 2 << 31 - 1 impossible, and raises opaque error invalid arguments. NCCL2 greatly increase the limit on many platforms by using size_t. This patch statically detects this type, and raises properly if the broadcast tensor exceeds the limit.

No test because I don't think our test suite should broadcast big tensors.

}

namespace {
// NCCL changed the umerical type used for count between NCCL1 and NCCL2.

This comment was marked as off-topic.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SsnL has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ssnl ssnl deleted the nccl_max branch September 10, 2018 21:49
PenghuiCheng pushed a commit to PenghuiCheng/pytorch that referenced this pull request Sep 11, 2018
Summary:
NCCL1 uses `int` as its numerical type for fields like `count`, which makes broadcasting tensors larger than `2 << 31 - 1` impossible, and raises opaque error `invalid arguments`. NCCL2 greatly increase the limit on many platforms by using `size_t`. This patch statically detects this type, and raises properly if the broadcast tensor exceeds the limit.

No test because I don't think our test suite should broadcast big tensors.
Pull Request resolved: pytorch#11466

Differential Revision: D9754753

Pulled By: SsnL

fbshipit-source-id: 73506450cae047e06b5b225b39efdb42d5d26685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants