Skip to content

Release GIL during DDP construction.#40877

Merged
seemethere merged 1 commit intopytorch:release/1.6from
malfet:malfet/cherry-pick-release-gil-into-1.6
Jul 1, 2020
Merged

Release GIL during DDP construction.#40877
seemethere merged 1 commit intopytorch:release/1.6from
malfet:malfet/cherry-pick-release-gil-into-1.6

Conversation

@malfet
Copy link
Copy Markdown
Contributor

@malfet malfet commented Jul 1, 2020

Cherry-pick #40495 into 1.6

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

  1. Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
    DDP construction.
  2. Rank 3 is a little slower and performs an RRef fetch call before the DDP
    construction.
  3. The RRef fetch call is done on Rank 0 and tries to acquire GIL.
  4. We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
    collective and Rank 3 is waiting for Rank 0 to release GIL.

Test Plan:

  1. Ran ddp_under_dist_autograd 500 times.
  2. waitforbuildbot

Summary:
Pull Request resolved: pytorch#40495

As part of debugging flaky ddp_under_dist_autograd tests, I realized
we were running into the following deadlock.

1) Rank 0 would go into DDP construction, hold GIL and wait for broadcast in
DDP construction.
2) Rank 3 is a little slower and performs an RRef fetch call before the DDP
construction.
3) The RRef fetch call is done on Rank 0 and tries to acquire GIL.
4) We now have a deadlock since Rank 0 is waiting for Rank 3 to enter the
collective and Rank 3 is waiting for Rank 0 to release GIL.
ghstack-source-id: 106534442

Test Plan:
1) Ran ddp_under_dist_autograd 500 times.
2) waitforbuildbot

Differential Revision: D22205180

fbshipit-source-id: 6afd55342e801b9edb9591ff25158a244a8ea66a
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented Jul 1, 2020

💊 CI failures summary and remediations

As of commit 6be6fd3 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_build is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 1 time.

@seemethere
Copy link
Copy Markdown
Member

Going to go ahead and merge this

@seemethere seemethere merged commit b4b8f5b into pytorch:release/1.6 Jul 1, 2020
@seemethere seemethere added this to the 1.6.0 milestone Jul 1, 2020
@malfet malfet deleted the malfet/cherry-pick-release-gil-into-1.6 branch July 1, 2020 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants