Skip to content

Use GIL to guard py::object's destruction#38348

Closed
mrshenli wants to merge 9 commits intogh/mrshenli/172/basefrom
gh/mrshenli/172/head
Closed

Use GIL to guard py::object's destruction#38348
mrshenli wants to merge 9 commits intogh/mrshenli/172/basefrom
gh/mrshenli/172/head

Conversation

@mrshenli
Copy link
Copy Markdown
Contributor

@mrshenli mrshenli commented May 12, 2020

Stack from ghstack:

Differential Revision: D21530282

@mrshenli mrshenli mentioned this pull request May 12, 2020
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 12, 2020
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 12, 2020

💊 CI failures summary and remediations

As of commit c34e59c (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 22 times.

Comment thread torch/csrc/jit/python/pybind_utils.h Outdated
// keep the py::objec alive until this function returns.
//
// Note [jit::toIValue barrow py::object refcnt]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ezyang, I am not sure if my understanding is correct here. Is it true that if we pass a py::object to toIValue, we need to make sure that this py::object is alive until toIValue returns, because py::handle here does not inc the refcnt?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is correct, because toIValue mostly used as function argument conversion, and it is almost certain that the caller manages the lifetime of the py::handle passed in.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a naive question, but what's the problem with changing toIValue to call inc_ref to keep the underlying object alive for the duration of the function call, or even just use py::object here? This would make it a lot easier for callers and potentially avoid bugs/flakiness. Are we concerned about the performance implications?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we care that the value of the py::object is persisted for the duration of markComplete(), not just jit::toIValue()?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If below is the code that runs when passing a py::object to py::handle, it actually inc the refcnt?

https://github.com/pybind/pybind11/blob/a86ac538f59db7af596ee192f42b9f39ec8039ed/include/pybind11/cast.h#L1623-L1625

struct pyobject_caster {
    template <typename T = type, enable_if_t<std::is_same<T, handle>::value, int> = 0>
    bool load(handle src, bool /* convert */) { value = src; return static_cast<bool>(value); }

    template <typename T = type, enable_if_t<std::is_base_of<object, T>::value, int> = 0>
    bool load(handle src, bool /* convert */) {
        if (!isinstance<type>(src))
            return false;
        value = reinterpret_borrow<type>(src);
        return true;
    }

    static handle cast(const handle &src, return_value_policy /* policy */, handle /* parent */) {
        return src.inc_ref();
    }
    PYBIND11_TYPE_CASTER(type, handle_type_name<type>::name);
};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we care that the value of the py::object is persisted for the duration of markComplete(), not just jit::toIValue()?

jit::toIValue should have created a copy of the py::object in the cast below. So as long as the IValue is alive, the py::object would also be alive I think?

static c10::intrusive_ptr<PyObjectHolder> create(const py::handle& handle) {
py::gil_scoped_acquire ag;
return c10::make_intrusive<ConcretePyObjectHolder>(
handle.cast<py::object>());
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be a naive question, but what's the problem with changing toIValue to call inc_ref to keep the underlying object alive for the duration of the function call, or even just use py::object here? This would make it a lot easier for callers and potentially avoid bugs/flakiness. Are we concerned about the performance implications?

Not sure what are the reason for using py::handle here. @wanchaol is there any specific reason for this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if there's a specific reason to keep a py::handle, merely to avoid redundant inc_refs as arguments are suppose to be alive when you pass to a function.

cc @zdevito who might have more context on this.

Copy link
Copy Markdown
Contributor

@xush6528 xush6528 May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if my understanding is correct here. Is it true that if we pass a py::object to toIValue, we need to make sure that this py::object is alive until toIValue returns, because py::handle here does not inc the refcnt?

Explanation for why py::object is alive until toIValue returns, so this change was not necessary.

Say we have

py::object o;

py::handle aa(o);  // copy constructor
py::handle aaa = o;  // copy constructor, as it's also initialization.

The constructor being used here is a compiler-synthesized default copy-constructor, py::handle(const py::handle&).
py::object is statically casted to const py::handle& type as it's a subtype.
Also the lifetime of the RValue(temporary) py::object returned by toPyPObject(..) is prolonged by binding to the const reference (const py::handle&).

Similar for,

py::handle bb; // The empty constructor of py::handle is called.
bb = o;  // copy assign operator

The constructor being used here is a compiler-synthesized default copy-assign-operator, py::handle::operator=(const py::handle&).

mrshenli pushed a commit that referenced this pull request May 12, 2020
ghstack-source-id: e7a5406
Pull Request resolved: #38348
@mrshenli mrshenli requested review from ezyang and jjlilley and removed request for jjlilley May 12, 2020 19:59
@@ -109,7 +109,10 @@ PyRRef::PyRRef(const py::object& value, const py::object& type_hint)
TypePtr elem_type = tryInferTypeWithTypeHint(value, type_hint);
auto rref = RRefContext::getInstance().createOwnerRRef(elem_type);
py::object copy(value); // increases refcount
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we already incref in the above by creating a copy

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wanchaol Do you know when we do std::move in the original code below, what happens to the reference cnt on the py::object?

And why do we need the copy here? The jit::toIValue below should have already inc the refcnt, no?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you move copy into toIValue, toIValue may take ownership of copy, and you would no longer have a lifetime guarantee through value. However, because you are calling PyRRef with a const reference to value, the general assumption is that this means that value will stay live for the entirety of the function.

Copy link
Copy Markdown
Contributor

@xush6528 xush6528 May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I read from "C++ Primer (5th edition)", 13.6.2.

Compiler only synthesizes move constructor for a class that has all its non-static members movable (have move constructor).
Since py::handle has a non-movable member, PyObject *m_ptr (https://github.com/pybind/pybind11/blob/master/include/pybind11/pytypes.h#L217), py::handle does not have a synthesized move constructor.

When a class does not have a move constructor defined, but an RValue is passed to its constructor, its copy constructor is called instead.

So these two calls are equivalent. The case converges to the one mentioned above.

IValue ivalue = jit::toIValue(copy, elem_type);
IValue ivalue = jit::toIValue(std::move(copy), elem_type);

Comment thread torch/csrc/jit/python/pybind_utils.h Outdated
// keep the py::objec alive until this function returns.
//
// Note [jit::toIValue barrow py::object refcnt]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is correct, because toIValue mostly used as function argument conversion, and it is almost certain that the caller manages the lifetime of the py::handle passed in.

Comment thread torch/csrc/jit/python/pybind_utils.h Outdated
// passes a py:object as the first argument, this will be copy by value and the
// py::handle is only live as long as py::object is, and the py::handle is not
// reference counted. Hence, the call site needs to keep py::object alive until
// a new copy is created by jit::toIValue().
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just say "keep py::object alive until toIValue()" returns? The caller should not have to be responsible for knowing the internal details, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me modify that.

Comment thread torch/csrc/distributed/rpc/py_rref.cpp Outdated
py::object copy(value); // increases refcount
IValue ivalue = jit::toIValue(std::move(copy), elem_type);
// Keep copy alive until jit::toIValue returns.
// See Note [jit::toIValue barrow py::object refcnt] at jit::toIValue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

barrow -> borrow

@jjlilley
Copy link
Copy Markdown

Wondering if this might fix a "Fatal Python error: GC object already tracked" (P130801732) error that I'm seeing on the master with prod-preset feed run?

(I never saw it before Friday, but saw it happen top of tree right after Shihao's change on Fri night, though could be unrelated)

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 12, 2020

I had a brief 1:1 chat with @mrshenli and I wanted to remind everyone of some important things to remember when doing multithreading and Python GIL stuff:

  1. Do not put smart Python C++ wrapped objects in structs that do not require GIL protection. They will get destructed when those structs get freed, EVEN if you explicitly write a destructor (destructor does not replace the element-wise destruction of each field!) It is much safer to store a raw pointer.
    1a. Do not put smart Python C++ wrapped objects in a scope where you don't also have a GIL taken out. The object will get destructed at the end of the scope, and when that destruction happens you need to have the GIL.
  2. You must take out the GIL whenever you read/write Python objects. In single-threaded settings this is hard to get wrong, but in multithreading you have to be very careful.

mrshenli pushed a commit that referenced this pull request May 13, 2020
ghstack-source-id: 5dc7b0e
Pull Request resolved: #38348
namespace {

py::object toPyObj(const Message& message) {
py::gil_scoped_release release;
Copy link
Copy Markdown
Contributor

@xush6528 xush6528 May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to understand why it is released here.
I guess this is for the call to PythonRpcHandler::getInstance()?

How about writing it as this instead,

    futureResponseMessage->addCallback(
        [jitFuture](const FutureMessage& futureResponseMessage) {
          if (futureResponseMessage.hasError()) {
            jitFuture->setError(futureResponseMessage.error()->what());
          } else {
            // Don't need to acquire GIL here, as toPyObj acquires GIL	
            // when creating the py::object
            py::obect py_obj = toPyObj(futureResponseMessage.constValue());
            jitFuture->markCompleted(jit::toIValue(py_obj, PyObjectType::get()));
            // Need GIL to destruct the py::object returned by toPyObj
            py::gil_scoped_acquire acquire;
            py_obj.dec_ref();
            py_obj.ptr() = nullptr;
          }
        });

It's clearer about the requirements that not only creating py::object requires GIL, but also destructing py::object does.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was for PythonRpcHandler. I updated the PR to remove toPyObj and added a toIValue, as I think former is no longer needed.

Comment thread torch/csrc/distributed/rpc/python_functions.cpp Outdated
PyObjectType::get()));
IValue value;
{
// Need GIL to destruct the py::object returned by toPyObj
Copy link
Copy Markdown
Contributor

@xush6528 xush6528 May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Before #37519,
the code was like this,

[&](FutureMessage& fut) { return toPyObj(fut.wait()); },

The py::object returned by toPyObj(..) was passed to Python directly. pybind11 takes care of its destruction. We took care of acquiring GIL when creating it.

While after the PR, this toPyObj(..) returned py::object is passed to toIValue(..) as const py::handle&, from which PyObjectHolder creates a new py::object copy.

handle.cast<py::object>());

With the destruction of the new copy taken care of by,

py_obj_ = py::none();

Now we have to take care of destruction of the old copy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly!

There might be other places that need some change, e.g., UnpickledPythonCall holds a py::object, but relies on caller to call the movePythonUdf. Will fix this on top of this PR.

mrshenli pushed a commit that referenced this pull request May 13, 2020
ghstack-source-id: af40315
Pull Request resolved: #38348
@@ -32,17 +32,25 @@ py::object toPyObj(const Message& message) {
Stack stack;
stack.push_back(ret.value());
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this bracket?

auto& resp = static_cast<PythonResp&>(*response);
auto& pythonRpcHandler = PythonRpcHandler::getInstance();
return pythonRpcHandler.deserialize(resp.serializedPyObj());
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

// Need GIL to destruct the py::object returned by deserialize()
py::gil_scoped_acquire acquire;
return jit::toIValue(
pythonRpcHandler.deserialize(resp.serializedPyObj()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works because py::gil_scoped_acquire is constructed before the temp py::object is returned by pythonRpcHandler.deserialize(..), they are destructed in reverse order.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@mrshenli merged this pull request in 756788e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants