Skip to content

Release GIL as much as possible in dist_autograd pybind.#61593

Closed
pritamdamania87 wants to merge 1 commit intogh/pritamdamania87/249/basefrom
gh/pritamdamania87/249/head
Closed

Release GIL as much as possible in dist_autograd pybind.#61593
pritamdamania87 wants to merge 1 commit intogh/pritamdamania87/249/basefrom
gh/pritamdamania87/249/head

Conversation

@pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Jul 13, 2021

Stack from ghstack:

Following the pattern in #61588
to avoid deadlocks as much as possible.

Differential Revision: D29683451

Following the pattern in #61588
to avoid deadlocks as much as possible.

Differential Revision: [D29683451](https://our.internmc.facebook.com/intern/diff/D29683451/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jul 13, 2021
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jul 13, 2021

💊 CI failures summary and remediations

As of commit 1bb9e92 (more details on the Dr. CI page and at hud.pytorch.org/pr/61593):



❄️ 1 failure tentatively classified as flaky

but reruns have not yet been triggered to confirm:

See GitHub Actions build Windows CI (pytorch-win-vs2019-cpu-py3) / test (default, 1, 2, windows.4xlarge) (1/1)

Step: "Run test scripts" (full log | diagnosis details | 🔁 rerun) ❄️

2021-07-13T19:17:39.8469344Z ("Connection broke...ly closed by the remote host', None, 10054, None))
2021-07-13T19:17:26.0005070Z  call conda install -y -q -c conda-forge cmake  
2021-07-13T19:17:26.0005456Z  if 0 NEQ 0 (exit /b 0  ) 
2021-07-13T19:17:26.0005720Z ) 
2021-07-13T19:17:28.1926326Z Collecting package metadata (current_repodata.json): ...working... done
2021-07-13T19:17:28.7809169Z Solving environment: ...working... done
2021-07-13T19:17:39.8466546Z 
2021-07-13T19:17:39.8467061Z 
2021-07-13T19:17:39.8467521Z ## Package Plan ##
2021-07-13T19:17:39.8467768Z 
2021-07-13T19:17:39.8468135Z   environment location: C:\Jenkins\Miniconda3
2021-07-13T19:17:39.8469344Z ("Connection broken: ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)", ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
2021-07-13T19:17:39.8470228Z 
2021-07-13T19:17:39.8470393Z 
2021-07-13T19:17:39.8470685Z   added / updated specs:
2021-07-13T19:17:39.8470985Z     - boto3
2021-07-13T19:17:39.8471268Z     - cffi
2021-07-13T19:17:39.8471539Z     - dataclasses
2021-07-13T19:17:39.8471824Z     - libuv
2021-07-13T19:17:39.8472063Z     - mkl
2021-07-13T19:17:39.8472311Z     - numba
2021-07-13T19:17:39.8472555Z     - numpy

ci.pytorch.org: 1 failed


Preview docs built from this PR

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

pritamdamania87 pushed a commit that referenced this pull request Jul 13, 2021
Following the pattern in #61588
to avoid deadlocks as much as possible.

Differential Revision: [D29683451](https://our.internmc.facebook.com/intern/diff/D29683451/)

ghstack-source-id: 133497897
Pull Request resolved: #61593
Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we've looked into building some tool/automatic analysis to detect these sort of issues? Currently seems pretty manual and easy to forget to release GIL when calling into pybind functions.

Also, do we know why our internal TSAN tests don't catch these sort of deadlocks (at least it seems like that to me)?

@pritamdamania87
Copy link
Contributor Author

@rohan-varma @ezyang pointed me to this: https://clang.llvm.org/docs/ThreadSafetyAnalysis.html, which does seem to be very useful.

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7d2ea9a.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/249/head branch July 18, 2021 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants