Skip to content

Fix flaky NCCL error handling tests.#42149

Closed
pritamdamania87 wants to merge 1 commit into
gh/pritamdamania87/149/basefrom
gh/pritamdamania87/149/head
Closed

Fix flaky NCCL error handling tests.#42149
pritamdamania87 wants to merge 1 commit into
gh/pritamdamania87/149/basefrom
gh/pritamdamania87/149/head

Conversation

@pritamdamania87

@pritamdamania87 pritamdamania87 commented Jul 28, 2020

Copy link
Copy Markdown
Contributor

Stack from ghstack:

Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.

Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.

Differential Revision: D22785042

#Closes: #31924

Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.

Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.

Differential Revision: [D22785042](https://our.internmc.facebook.com/intern/diff/D22785042/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Jul 28, 2020
Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.

Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.

Differential Revision: [D22785042](https://our.internmc.facebook.com/intern/diff/D22785042/)

ghstack-source-id: 108629057
Pull Request resolved: #42149
@dr-ci

dr-ci Bot commented Jul 28, 2020

Copy link
Copy Markdown

💊 CI failures summary and remediations

As of commit 56802b7 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

@facebook-github-bot

Copy link
Copy Markdown
Contributor

This pull request has been merged in 8deb4fe.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/149/head branch August 1, 2020 14:26
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#42149

Some of these tests were flaky since we could kill the process in some
way without cleaning up the ProcessGroup. This resulted in issues where the
FileStore didn't clean up appropriately resulting in other processes in the
group to crash.

Fixed this by explicitly deleting the process_group before we bring a process
down forcibly.
ghstack-source-id: 108629057

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D22785042

fbshipit-source-id: c31d0f723badbc23b7258e322f75b57e0a1a42cf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants