Skip to content

clean up GIL usuage#32748

Closed
zhaojuanmao wants to merge 2 commits intogh/zhaojuanmao/30/basefrom
gh/zhaojuanmao/30/head
Closed

clean up GIL usuage#32748
zhaojuanmao wants to merge 2 commits intogh/zhaojuanmao/30/basefrom
gh/zhaojuanmao/30/head

Conversation

@zhaojuanmao
Copy link
Contributor

@zhaojuanmao zhaojuanmao commented Jan 29, 2020

Stack from ghstack:

This is to follow up PR #30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so

  1. pyRef::to_here() and pyRef::local_value() added GIL
  2. pyRef::pickle and pyRef::unpickle() added GIL release tag
  3. in request_callback_impl, also added GIL as needed
  4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function

Differential Revision: D19612337

This is to follow up PR #30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function

Differential Revision: [D19612337](https://our.internmc.facebook.com/intern/diff/D19612337/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Jan 29, 2020
This is to follow up PR #30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function

Differential Revision: [D19612337](https://our.internmc.facebook.com/intern/diff/D19612337/)

ghstack-source-id: 97355818
Pull Request resolved: #32748
@zhaojuanmao zhaojuanmao requested a review from xush6528 January 29, 2020 00:36
@zhaojuanmao zhaojuanmao assigned wanchaol and unassigned wanchaol Jan 29, 2020
@zhaojuanmao zhaojuanmao requested a review from wanchaol January 29, 2020 00:37
@kostmo
Copy link
Member

kostmo commented Jan 29, 2020

💊 CircleCI build failures summary and remediations

As of commit 6855011:

  • 1/2 broken upstream at merge base c729614 since Jan 29

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch origin viable/strict
    git rebase --onto viable/strict $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch origin viable/strict
    git rebase viable/strict
    

    Check out the recency history of this "viable master" tracking branch.

  • 1/2 failures introduced in this PR

Detailed failure analysis

One may explore the probable reasons each build failed interactively on the Dr. CI website.

🕵️ 1 new failure recognized by patterns

The following build failures do not appear to be due to upstream breakage:

See CircleCI build pytorch_macos_10_13_py3_test (1/1)

Step: "Test" (full log | pattern match details)

Jan 29 01:01:12 RuntimeError: test_autograd failed!
Jan 29 01:01:12 Ran 1046 tests in 307.708s 
Jan 29 01:01:12  
Jan 29 01:01:12 FAILED (errors=1, skipped=13, expected failures=1) 
Jan 29 01:01:12  
Jan 29 01:01:12 Generating XML reports... 
Jan 29 01:01:12 Traceback (most recent call last): 
Jan 29 01:01:12   File "test/run_test.py", line 457, in <module> 
Jan 29 01:01:12     main() 
Jan 29 01:01:12   File "test/run_test.py", line 450, in main 
Jan 29 01:01:12     raise RuntimeError(message) 
Jan 29 01:01:12 RuntimeError: test_autograd failed! 
Jan 29 01:01:12 + cleanup 
Jan 29 01:01:12 + retcode=1 
Jan 29 01:01:12 + set +x 

🚧 1 upstream failure recognized by patterns:

These builds matched patterns, but were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

This comment has been revised 6 times.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks!

I just have some minor questions inline.

Regarding the test failure, looks like test_quantized_rnn is cleared on master. Could you please try rebase to the current master to trigger tests again?

{
// acquiring GIL as torch::jit::toPyObject creates new py::object without
// grabbing the GIL.
pybind11::gil_scoped_acquire ag;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(can be done in followup PR) do we need to use PROFILE_GIL_SCOPED_ACQUIRE to profile this?

pySerialize_ = py::none();
pyHandleException_ = py::none();
jitCompilationUnit_ = nullptr;
typeParser_ = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, it does not seem ScriptTypeParser depends on Python. Is setting it to nullptr here just being defensive or will it indeed hit error in certain destruction order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just be defensive, do you think whether there is an issue to do so?

// __setstate__
return PyRRef::unpickle(t);
}),
py::call_guard<py::gil_scoped_release>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would I be correct if I assume this GIL acquirement is moved to rref_impl.cpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this tag, mainly because PyRRef::unpickle() called pythonrpchandler related stuff that requires GIL. but yes, we moved GIL to rref_impl.cpp as well

Copy link
Contributor

@xush6528 xush6528 Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important, but to be accurate.
In PyRRef::unpickle(), it seems PythonRpcHandler::getInstance().parseTypeFromStr(rfd.type_str_); does not require GIL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but it could wait for pythonrpchandler to be constructed in another thread, and that thread is trying to aquire GIL, if we do not release GIL in unpickle, it could result in deadlock

if (rref_->isPyObj()) {
// UserRRef<py::object>::toHere() calls python_rpc_handler which acquires
// GIL.
return jit::toPyObject(
Copy link
Contributor

@xush6528 xush6528 Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 facts that help me understand jit::toPyObject() (which returns a py::object) requires GIL.

  1. Constructing and destructing an py::object C++ object changes the Python object's ref count.

Copy constructor; always increases the reference count.

Destructor; automatically calls handle::dec_ref().

Ref: https://pybind11.readthedocs.io/en/stable/reference.html#_CPPv4N6object6objectERK6object

  1. Changing Python object's ref count requires acquiring GIL.

The Python interpreter is not fully thread-safe. In order to support multi-threaded Python programs, there’s a global lock, called the global interpreter lock or GIL, that must be held by the current thread before it can safely access Python objects.

Without the lock, even the simplest operations could cause problems in a multi-threaded program: for example, when two threads simultaneously increment the reference count of the same object, the reference count could end up being incremented only once instead of twice.

Ref: https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock

Combing them together, when constructing or using a py:object C++ object, the GIL should be acquired.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given @xush6528's comment, we need to extend GIL to cover all occurrences of py::object?


py::tuple RRefForkData::toPyTuple() const {
// add GIL as it is contructing a py::object
pybind11::gil_scoped_acquire ag;
Copy link
Contributor

@xush6528 xush6528 Jan 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need GIL for RRefForkData::fromPyTuple as well, since RRefForkData::fromPyTuple is using the py::object subtype, py::tuple? or it has already acquired GIL, how?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, let me add GIL for fromPyTuple as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a const reference of the py::tuple and does not seem to create/delete py::object. Is it still necessary to use GIL in fromPyTuple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess GIL here can help avoid underneath python object pointed by pyTuple could be mutated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli Accessing (both read and write) a Python object also requires GIL, otherwise, other thread could get GIL and invalidate that Python object.

@zhaojuanmao
Copy link
Contributor Author

@mrshenli I think we added GIL for all py::objects after adding GIL to fromPyTuple

This is to follow up PR #30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function

Differential Revision: [D19612337](https://our.internmc.facebook.com/intern/diff/D19612337/)

[ghstack-poisoned]
zhaojuanmao added a commit that referenced this pull request Jan 29, 2020
Pull Request resolved: #32748

This is to follow up PR #30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function
ghstack-source-id: 97373011

Differential Revision: [D19612337](https://our.internmc.facebook.com/intern/diff/D19612337/)
Copy link
Contributor

@xush6528 xush6528 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to go now.


py::tuple RRefForkData::toPyTuple() const {
// add GIL as it is contructing a py::object
pybind11::gil_scoped_acquire ag;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a const reference of the py::tuple and does not seem to create/delete py::object. Is it still necessary to use GIL in fromPyTuple.

wuhuikx pushed a commit to wuhuikx/pytorch that referenced this pull request Jan 30, 2020
Summary:
Pull Request resolved: pytorch#32748

This is to follow up PR pytorch#30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function
ghstack-source-id: 97373011

Test Plan: unit test

Differential Revision: D19612337

fbshipit-source-id: 4d09f9b52ba626545ae7d31fea6b671301ed3890
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b5d8982.

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in b5d8982.

@facebook-github-bot facebook-github-bot deleted the gh/zhaojuanmao/30/head branch February 2, 2020 15:18
ttumiel pushed a commit to ttumiel/pytorch that referenced this pull request Mar 4, 2020
Summary:
Pull Request resolved: pytorch#32748

This is to follow up PR pytorch#30630, we need to have GIL when calling jit::toPyObject(), for some binded functions need to be taged with GIL release if underneath C++ codes requires GIL. so
1. pyRef::to_here() and pyRef::local_value() added GIL
2. pyRef::pickle and pyRef::unpickle() added GIL release tag
3. in request_callback_impl, also added GIL as needed
4. for typeParser, use cached jitCompilationUnit_, also clean it up in cleanUp() function
ghstack-source-id: 97373011

Test Plan: unit test

Differential Revision: D19612337

fbshipit-source-id: 4d09f9b52ba626545ae7d31fea6b671301ed3890
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants