[RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg by ericl · Pull Request #19910 · ray-project/ray

ericl · 2021-10-30T06:05:48Z

Why are these changes needed?

This fixes a reference counting bug in the following scenario.

A task generates an object with a custom owner, e.g., return [ray.put(x, _owner=block_owner)]. We record in the ref table for x that the caller owns the object that x is stored in (stored_in=[caller]).
A subsequent task is scheduled onto the same worker, which takes x as an argument. After the task finishes, we call PopAndClearLocalBorrowers, which erases the stored_in map of the reference table and sends it to the caller. But the caller is not the owner of x--- the real owner of x has yet to send the WaitForRefRemoved RPC.
The WaitForRefRemoved RPC from the block_owner arrives at the same worker. There are no local refs to x any more, and the stored_in map is empty. The owner thinks x can be deleted, however the caller still has a reference to x---> raising ReferenceCountingAssertionError.

The proposed fix here is to not clear stored_in in PopAndClearLocalBorrowers().

Related issue number

This fixes test_dataset_pipeline.py::pipeline_actors and test_dataset.py::test_callable_classes (both mixing tasks/actors returning/consuming the same object ref) in #19907.

ericl · 2021-10-30T06:06:10Z

src/ray/core_worker/reference_count.cc

+    // Don't clear stored_in values, which may be from previous tasks that
+    // created this same object id.
+    RAY_CHECK(GetAndClearLocalBorrowersInternal(borrowed_id, &borrowed_refs,
+                                                /*clear_stored=*/false))


This is the fix.

ericl · 2021-10-30T06:07:37Z

src/ray/core_worker/reference_count.cc

-        // This should be the first time that we have stored this object ID
-        // inside this return ID.
-        RAY_CHECK(inserted);
+        RAY_UNUSED(inner_it->second.stored_in_objects.emplace(object_id, owner_address));


Haven't looked into why this check removal was needed.

stephanie-wang · 2021-10-31T20:53:05Z

I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases?

What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition.

ericl · 2021-11-01T19:52:54Z

I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases?

Hmm, could we address this by only clearing certain entries from stored_in_objects (e.g., filtering on where the owner == the owner of the current task?)

What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition.

That would race though, since the outer owner isn't aware of the object until the task finishes, unless we add logic to fix that up in this case. I also think it's fundamentally not addressing the issue of store_in being incorrectly cleared and returned to the wrong process.

stephanie-wang · 2021-11-01T20:23:24Z

I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases?

Hmm, could we address this by only clearing certain entries from stored_in_objects (e.g., filtering on where the owner == the owner of the current task?)

That sounds a bit complicated... I think that case happens normally too, not just when there is ownership change.

What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition.

That would race though, since the outer owner isn't aware of the object until the task finishes, unless we add logic to fix that up in this case. I also think it's fundamentally not addressing the issue of store_in being incorrectly cleared and returned to the wrong process.

I don't think that's true, the new owner tells the outer object owner that it has a nested ObjectID during the WaitForRefRemoved RPC (that's why we have the contained_in_id arg).

ericl · 2021-11-02T01:41:39Z

src/ray/core_worker/reference_count.cc

+  // foreign owner from learning about the parent task borrowing this value.
+  if (!it->second.foreign_owner_already_monitoring) {
+    it->second.stored_in_objects.clear();
+  }


This is the fix.

ericl · 2021-11-02T01:47:56Z

Per offline discussion, updated to track when an object is created with a foreign owner process explicitly, and skip clearing in that case. This is a light-weight version of tracking what metadata was added by the task specifically; that might be a needed refactoring but would be a much larger change.

Open to suggestions on how to write a unit tests here; otherwise this is covered by test_dataset.py::test_callable_classes which will immediately fail without this PR.

stephanie-wang · 2021-11-02T02:02:36Z

Per offline discussion, updated to track when an object is created with a foreign owner process explicitly, and skip clearing in that case. This is a light-weight version of tracking what metadata was added by the task specifically; that might be a needed refactoring but would be a much larger change.

Open to suggestions on how to write a unit tests here; otherwise this is covered by test_dataset.py::test_callable_classes which will immediately fail without this PR.

It should be possible to write this with a unit test. Let me know if you need help with it. Here's an example that might be useful to copy.

ericl · 2021-11-02T05:01:15Z

src/ray/core_worker/reference_count_test.cc

+  owner->rc_.RemoveLocalReference(return_id2, nullptr);
+  ASSERT_FALSE(owner->rc_.HasReference(inner_id));
+  ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id));
+  ASSERT_FALSE(caller->rc_.HasReference(inner_id));


@stephanie-wang need a bit a bit of help figuring out why this last assert is failing.

I'm not sure about this line, but the line above should fail because the foreign_owner has to send another WaitForRefRemoved RPC to caller (can trigger the handler with caller->FlushBorrowerCallbacks()).

It turns out this was a real reference leak (had to change the logic to not return the borrower reference in the foreign owner case).

stephanie-wang

The fix looks good! I had some questions about the test case.

src/ray/core_worker/reference_count.h

src/ray/core_worker/reference_count_test.cc

stephanie-wang · 2021-11-02T21:44:43Z

src/ray/core_worker/reference_count_test.cc

+  owner->rc_.RemoveLocalReference(return_id2, nullptr);
+  ASSERT_FALSE(owner->rc_.HasReference(inner_id));
+  ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id));
+  ASSERT_FALSE(caller->rc_.HasReference(inner_id));


I'm not sure about this line, but the line above should fail because the foreign_owner has to send another WaitForRefRemoved RPC to caller (can trigger the handler with caller->FlushBorrowerCallbacks()).

ericl

Updated, thanks for the tips on the test!

src/ray/core_worker/reference_count.h

ericl · 2021-11-03T02:00:27Z

src/ray/core_worker/reference_count_test.cc

+  owner->rc_.RemoveLocalReference(return_id2, nullptr);
+  ASSERT_FALSE(owner->rc_.HasReference(inner_id));
+  ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id));
+  ASSERT_FALSE(caller->rc_.HasReference(inner_id));


It turns out this was a real reference leak (had to change the logic to not return the borrower reference in the foreign owner case).

ericl · 2021-11-03T02:01:14Z

src/ray/core_worker/reference_count.cc

-  it->second.borrowers.clear();
-  it->second.stored_in_objects.clear();
+  if (for_ref_removed || !it->second.foreign_owner_already_monitoring) {
+    borrowed_refs->emplace(object_id, it->second);


Had to also move this into the conditional, to avoid a reference leak.

stephanie-wang

Nice! It's great to have more unit testing on this codepath.

…ct from going out of scope (#21719) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

…ct from going out of scope (#22120) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs. This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.

…ct from going out of scope (ray-project#22120) When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope. This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in ray-project#19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not. This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in ray-project#19910, but generalized to all initial ObjectRefs. This is a re-merge for ray-project#21719 with a fix for removing the owned object ref if creation fails.

ericl added 3 commits October 29, 2021 22:37

fix it

85253ae

wip

565dbd1

wip

fbd7639

ericl assigned scv119 and stephanie-wang Oct 30, 2021

ericl commented Oct 30, 2021

View reviewed changes

fix

17c8950

ericl commented Oct 30, 2021

View reviewed changes

ericl assigned rkooo567 Oct 30, 2021

ericl changed the title ~~[RFC] Reference counting bug when the object ref transits the same worker twice~~ [RFC] Reference counting bug when the object ref transits the same worker twice as a nested return and then arg Oct 30, 2021

ericl changed the title ~~[RFC] Reference counting bug when the object ref transits the same worker twice as a nested return and then arg~~ [RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg Oct 30, 2021

ericl added 3 commits November 1, 2021 18:39

fix

f61869b

fix it

4de10bb

fix

ad7b747

ericl commented Nov 2, 2021

View reviewed changes

unit test

7ed8862

ericl commented Nov 2, 2021

View reviewed changes

ericl added 4 commits November 1, 2021 22:02

Merge remote-tracking branch 'upstream/master' into fix-ref-bug4

df63f06

update

2b86849

update

eda9a08

Merge remote-tracking branch 'upstream/master' into fix-ref-bug4

0c87dde

stephanie-wang requested changes Nov 2, 2021

View reviewed changes

stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 2, 2021

comments

5a36365

ericl force-pushed the fix-ref-bug4 branch from cda5388 to 5a36365 Compare November 3, 2021 01:59

ericl commented Nov 3, 2021

View reviewed changes

ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2021

ericl commented Nov 3, 2021

View reviewed changes

stephanie-wang approved these changes Nov 3, 2021

View reviewed changes

ericl merged commit 28d4cfb into ray-project:master Nov 3, 2021

stephanie-wang mentioned this pull request Jan 24, 2022

[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope #21719

Merged

6 tasks

stephanie-wang mentioned this pull request Feb 4, 2022

[core] Increment ref count when creating an ObjectRef to prevent object from going out of scope #22120

Merged

Conversation

ericl commented Oct 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang commented Oct 31, 2021

Uh oh!

ericl commented Nov 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephanie-wang commented Nov 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl commented Nov 2, 2021

Uh oh!

stephanie-wang commented Nov 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stephanie-wang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericl commented Oct 30, 2021 •

edited

Loading

ericl commented Nov 1, 2021 •

edited

Loading