Use GIL to guard py::object's destruction#38348
Use GIL to guard py::object's destruction#38348mrshenli wants to merge 9 commits intogh/mrshenli/172/basefrom
Conversation
[ghstack-poisoned]
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit c34e59c (more details on the Dr. CI page):
ci.pytorch.org: 1 failedThis comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 22 times. |
| // keep the py::objec alive until this function returns. | ||
| // | ||
| // Note [jit::toIValue barrow py::object refcnt] | ||
| // ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Hey @ezyang, I am not sure if my understanding is correct here. Is it true that if we pass a py::object to toIValue, we need to make sure that this py::object is alive until toIValue returns, because py::handle here does not inc the refcnt?
There was a problem hiding this comment.
Yes this is correct, because toIValue mostly used as function argument conversion, and it is almost certain that the caller manages the lifetime of the py::handle passed in.
There was a problem hiding this comment.
Might be a naive question, but what's the problem with changing toIValue to call inc_ref to keep the underlying object alive for the duration of the function call, or even just use py::object here? This would make it a lot easier for callers and potentially avoid bugs/flakiness. Are we concerned about the performance implications?
There was a problem hiding this comment.
Don't we care that the value of the py::object is persisted for the duration of markComplete(), not just jit::toIValue()?
There was a problem hiding this comment.
If below is the code that runs when passing a py::object to py::handle, it actually inc the refcnt?
struct pyobject_caster {
template <typename T = type, enable_if_t<std::is_same<T, handle>::value, int> = 0>
bool load(handle src, bool /* convert */) { value = src; return static_cast<bool>(value); }
template <typename T = type, enable_if_t<std::is_base_of<object, T>::value, int> = 0>
bool load(handle src, bool /* convert */) {
if (!isinstance<type>(src))
return false;
value = reinterpret_borrow<type>(src);
return true;
}
static handle cast(const handle &src, return_value_policy /* policy */, handle /* parent */) {
return src.inc_ref();
}
PYBIND11_TYPE_CASTER(type, handle_type_name<type>::name);
};There was a problem hiding this comment.
Don't we care that the value of the py::object is persisted for the duration of markComplete(), not just jit::toIValue()?
jit::toIValue should have created a copy of the py::object in the cast below. So as long as the IValue is alive, the py::object would also be alive I think?
pytorch/torch/csrc/jit/python/python_ivalue.h
Lines 17 to 21 in ee1ddce
There was a problem hiding this comment.
Might be a naive question, but what's the problem with changing toIValue to call inc_ref to keep the underlying object alive for the duration of the function call, or even just use py::object here? This would make it a lot easier for callers and potentially avoid bugs/flakiness. Are we concerned about the performance implications?
Not sure what are the reason for using py::handle here. @wanchaol is there any specific reason for this?
There was a problem hiding this comment.
I don't remember if there's a specific reason to keep a py::handle, merely to avoid redundant inc_refs as arguments are suppose to be alive when you pass to a function.
cc @zdevito who might have more context on this.
There was a problem hiding this comment.
I am not sure if my understanding is correct here. Is it true that if we pass a
py::objecttotoIValue, we need to make sure that thispy::objectis alive untiltoIValuereturns, because py::handle here does not inc the refcnt?
Explanation for why py::object is alive until toIValue returns, so this change was not necessary.
Say we have
py::object o;
py::handle aa(o); // copy constructor
py::handle aaa = o; // copy constructor, as it's also initialization.The constructor being used here is a compiler-synthesized default copy-constructor, py::handle(const py::handle&).
py::object is statically casted to const py::handle& type as it's a subtype.
Also the lifetime of the RValue(temporary) py::object returned by toPyPObject(..) is prolonged by binding to the const reference (const py::handle&).
Similar for,
py::handle bb; // The empty constructor of py::handle is called.
bb = o; // copy assign operatorThe constructor being used here is a compiler-synthesized default copy-assign-operator, py::handle::operator=(const py::handle&).
Differential Revision: [D21530282](https://our.internmc.facebook.com/intern/diff/D21530282) [ghstack-poisoned]
Differential Revision: [D21530282](https://our.internmc.facebook.com/intern/diff/D21530282) [ghstack-poisoned]
| @@ -109,7 +109,10 @@ PyRRef::PyRRef(const py::object& value, const py::object& type_hint) | |||
| TypePtr elem_type = tryInferTypeWithTypeHint(value, type_hint); | |||
| auto rref = RRefContext::getInstance().createOwnerRRef(elem_type); | |||
| py::object copy(value); // increases refcount | |||
There was a problem hiding this comment.
we already incref in the above by creating a copy
There was a problem hiding this comment.
Hey @wanchaol Do you know when we do std::move in the original code below, what happens to the reference cnt on the py::object?
And why do we need the copy here? The jit::toIValue below should have already inc the refcnt, no?
There was a problem hiding this comment.
When you move copy into toIValue, toIValue may take ownership of copy, and you would no longer have a lifetime guarantee through value. However, because you are calling PyRRef with a const reference to value, the general assumption is that this means that value will stay live for the entirety of the function.
There was a problem hiding this comment.
This is what I read from "C++ Primer (5th edition)", 13.6.2.
Compiler only synthesizes move constructor for a class that has all its non-static members movable (have move constructor).
Since py::handle has a non-movable member, PyObject *m_ptr (https://github.com/pybind/pybind11/blob/master/include/pybind11/pytypes.h#L217), py::handle does not have a synthesized move constructor.
When a class does not have a move constructor defined, but an RValue is passed to its constructor, its copy constructor is called instead.
So these two calls are equivalent. The case converges to the one mentioned above.
IValue ivalue = jit::toIValue(copy, elem_type);
IValue ivalue = jit::toIValue(std::move(copy), elem_type);| // keep the py::objec alive until this function returns. | ||
| // | ||
| // Note [jit::toIValue barrow py::object refcnt] | ||
| // ~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Yes this is correct, because toIValue mostly used as function argument conversion, and it is almost certain that the caller manages the lifetime of the py::handle passed in.
| // passes a py:object as the first argument, this will be copy by value and the | ||
| // py::handle is only live as long as py::object is, and the py::handle is not | ||
| // reference counted. Hence, the call site needs to keep py::object alive until | ||
| // a new copy is created by jit::toIValue(). |
There was a problem hiding this comment.
Could this just say "keep py::object alive until toIValue()" returns? The caller should not have to be responsible for knowing the internal details, right?
There was a problem hiding this comment.
Yes, let me modify that.
| py::object copy(value); // increases refcount | ||
| IValue ivalue = jit::toIValue(std::move(copy), elem_type); | ||
| // Keep copy alive until jit::toIValue returns. | ||
| // See Note [jit::toIValue barrow py::object refcnt] at jit::toIValue |
|
Wondering if this might fix a "Fatal Python error: GC object already tracked" (P130801732) error that I'm seeing on the master with prod-preset feed run? (I never saw it before Friday, but saw it happen top of tree right after Shihao's change on Fri night, though could be unrelated) |
|
I had a brief 1:1 chat with @mrshenli and I wanted to remind everyone of some important things to remember when doing multithreading and Python GIL stuff:
|
Differential Revision: [D21530282](https://our.internmc.facebook.com/intern/diff/D21530282) [ghstack-poisoned]
| namespace { | ||
|
|
||
| py::object toPyObj(const Message& message) { | ||
| py::gil_scoped_release release; |
There was a problem hiding this comment.
It's hard to understand why it is released here.
I guess this is for the call to PythonRpcHandler::getInstance()?
How about writing it as this instead,
futureResponseMessage->addCallback(
[jitFuture](const FutureMessage& futureResponseMessage) {
if (futureResponseMessage.hasError()) {
jitFuture->setError(futureResponseMessage.error()->what());
} else {
// Don't need to acquire GIL here, as toPyObj acquires GIL
// when creating the py::object
py::obect py_obj = toPyObj(futureResponseMessage.constValue());
jitFuture->markCompleted(jit::toIValue(py_obj, PyObjectType::get()));
// Need GIL to destruct the py::object returned by toPyObj
py::gil_scoped_acquire acquire;
py_obj.dec_ref();
py_obj.ptr() = nullptr;
}
});It's clearer about the requirements that not only creating py::object requires GIL, but also destructing py::object does.
There was a problem hiding this comment.
Yes, it was for PythonRpcHandler. I updated the PR to remove toPyObj and added a toIValue, as I think former is no longer needed.
| PyObjectType::get())); | ||
| IValue value; | ||
| { | ||
| // Need GIL to destruct the py::object returned by toPyObj |
There was a problem hiding this comment.
I see.
Before #37519,
the code was like this,
pytorch/torch/csrc/distributed/rpc/init.cpp
Line 340 in 5809288
The py::object returned by toPyObj(..) was passed to Python directly. pybind11 takes care of its destruction. We took care of acquiring GIL when creating it.
While after the PR, this toPyObj(..) returned py::object is passed to toIValue(..) as const py::handle&, from which PyObjectHolder creates a new py::object copy.
With the destruction of the new copy taken care of by,
Now we have to take care of destruction of the old copy.
There was a problem hiding this comment.
Yes, exactly!
There might be other places that need some change, e.g., UnpickledPythonCall holds a py::object, but relies on caller to call the movePythonUdf. Will fix this on top of this PR.
Differential Revision: [D21530282](https://our.internmc.facebook.com/intern/diff/D21530282) [ghstack-poisoned]
| @@ -32,17 +32,25 @@ py::object toPyObj(const Message& message) { | |||
| Stack stack; | |||
| stack.push_back(ret.value()); | |||
| { | |||
There was a problem hiding this comment.
We don't need this bracket?
| auto& resp = static_cast<PythonResp&>(*response); | ||
| auto& pythonRpcHandler = PythonRpcHandler::getInstance(); | ||
| return pythonRpcHandler.deserialize(resp.serializedPyObj()); | ||
| { |
| // Need GIL to destruct the py::object returned by deserialize() | ||
| py::gil_scoped_acquire acquire; | ||
| return jit::toIValue( | ||
| pythonRpcHandler.deserialize(resp.serializedPyObj()), |
There was a problem hiding this comment.
It works because py::gil_scoped_acquire is constructed before the temp py::object is returned by pythonRpcHandler.deserialize(..), they are destructed in reverse order.
Differential Revision: [D21530282](https://our.internmc.facebook.com/intern/diff/D21530282) [ghstack-poisoned]
Summary: Pull Request resolved: pytorch#38348 Test Plan: Imported from OSS Differential Revision: D21530282 Pulled By: mrshenli fbshipit-source-id: a507402fbbd89618936ac6eecb4a223ab86236c6
Stack from ghstack:
Differential Revision: D21530282