Skip to content

[NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction#40241

Closed
osalpekar wants to merge 3 commits intogh/osalpekar/43/basefrom
gh/osalpekar/43/head
Closed

[NCCL] Explicitly Abort NCCL Communicators on Process Group Destruction#40241
osalpekar wants to merge 3 commits intogh/osalpekar/43/basefrom
gh/osalpekar/43/head

Conversation

@osalpekar
Copy link
Copy Markdown
Contributor

@osalpekar osalpekar commented Jun 18, 2020

Stack from ghstack:

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231

Differential Revision: D22103662

NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231

Differential Revision: [D22103662](https://our.internmc.facebook.com/intern/diff/D22103662/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22103662/)!

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Jun 18, 2020
We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231

Differential Revision: [D22103662](https://our.internmc.facebook.com/intern/diff/D22103662/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22103662/)!

ghstack-source-id: 106103869
Pull Request resolved: #40241
@osalpekar osalpekar requested a review from jiayisuse June 18, 2020 21:25
…p Destruction"

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231

Differential Revision: [D22103662](https://our.internmc.facebook.com/intern/diff/D22103662/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22103662/)!

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Jun 22, 2020
Pull Request resolved: #40241

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231
ghstack-source-id: 106368499

Differential Revision: [D22103662](https://our.internmc.facebook.com/intern/diff/D22103662/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22103662/)!
…p Destruction"

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231

Differential Revision: [D22103662](https://our.internmc.facebook.com/intern/diff/D22103662/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22103662/)!

[ghstack-poisoned]
osalpekar added a commit that referenced this pull request Jun 23, 2020
Pull Request resolved: #40241

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: #32231
ghstack-source-id: 106469423

Differential Revision: [D22103662](https://our.internmc.facebook.com/intern/diff/D22103662/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D22103662/)!
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Jun 24, 2020

💊 CI failures summary and remediations

As of commit d11f7e4 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@osalpekar
Copy link
Copy Markdown
Contributor Author

Approved internally on Phabricator

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in 527ab13.

@zhaojuanmao
Copy link
Copy Markdown
Contributor

reverting this PR, as c10d test got flaky since this PR https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-xenial-rocm3.3-py3.6-test2/2040//console

@facebook-github-bot facebook-github-bot deleted the gh/osalpekar/43/head branch June 28, 2020 14:17
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…on (pytorch#40241)

Summary:
Pull Request resolved: pytorch#40241

We abort incomplete NCCL Communicators in the ProcessGroupNCCL
destructor, otherwise pending NCCL communciators may block other CUDA ops.

Closes: pytorch#32231
ghstack-source-id: 106469423

Test Plan: CI/Sandcastle

Reviewed By: jiayisuse

Differential Revision: D22103662

fbshipit-source-id: 1f6f88b56bd7a5e9ca5a41698995a76e60e8ad9f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants