Release GIL during DDP construction.#40495
Closed
pritamdamania87 wants to merge 2 commits intogh/pritamdamania87/145/basefrom
Closed
Release GIL during DDP construction.#40495pritamdamania87 wants to merge 2 commits intogh/pritamdamania87/145/basefrom
pritamdamania87 wants to merge 2 commits intogh/pritamdamania87/145/basefrom
Conversation
As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/) [ghstack-poisoned]
pritamdamania87
pushed a commit
that referenced
this pull request
Jun 24, 2020
As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/) ghstack-source-id: 106491684 Pull Request resolved: #40495
💊 CI failures summary and remediationsAs of commit 53d8462 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 4 times. |
zhaojuanmao
approved these changes
Jun 24, 2020
Contributor
zhaojuanmao
left a comment
There was a problem hiding this comment.
Good catch! It should be safe to release GIL here as there is no python related calls in reducer constructor
As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/) [ghstack-poisoned]
pritamdamania87
pushed a commit
that referenced
this pull request
Jun 24, 2020
Pull Request resolved: #40495 As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. ghstack-source-id: 106534442 Differential Revision: [D22205180](https://our.internmc.facebook.com/intern/diff/D22205180/)
Contributor
|
This pull request has been merged in ea06db9. |
This was referenced Jun 29, 2020
Closed
Closed
malfet
pushed a commit
to malfet/pytorch
that referenced
this pull request
Jul 1, 2020
Summary: Pull Request resolved: pytorch#40495 As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. ghstack-source-id: 106534442 Test Plan: 1) Ran ddp_under_dist_autograd 500 times. 2) waitforbuildbot Differential Revision: D22205180 fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
seemethere
pushed a commit
that referenced
this pull request
Jul 1, 2020
Summary: Pull Request resolved: #40495 As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. ghstack-source-id: 106534442 Test Plan: 1) Ran ddp_under_dist_autograd 500 times. 2) waitforbuildbot Differential Revision: D22205180 fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a Co-authored-by: Pritam Damania <pritam.damania@fb.com>
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
Summary: Pull Request resolved: pytorch#40495 As part of debugging flaky ddp_under_dist_autograd tests, I realized we were running into the following deadlock. 1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in DDP construction. 2) Rank 3 is a little slower and performs an RRef fetch call before the DDP construction. 3) The RRef fetch call is done on Rank 0 and tries to acquire GIL. 4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the collective and Rank 3 is waiting for Rank 0 to release GIL. ghstack-source-id: 106534442 Test Plan: 1) Ran ddp_under_dist_autograd 500 times. 2) waitforbuildbot Differential Revision: D22205180 fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack:
As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.
DDP construction.
construction.
collective and Rank 3 is waiting for Rank 0 to release GIL.
Differential Revision: D22205180