Skip to content

[RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg#19910

Merged
ericl merged 13 commits intoray-project:masterfrom
ericl:fix-ref-bug4
Nov 3, 2021
Merged

[RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg#19910
ericl merged 13 commits intoray-project:masterfrom
ericl:fix-ref-bug4

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Oct 30, 2021

Why are these changes needed?

This fixes a reference counting bug in the following scenario.

  1. A task generates an object with a custom owner, e.g., return [ray.put(x, _owner=block_owner)]. We record in the ref table for x that the caller owns the object that x is stored in (stored_in=[caller]).
  2. A subsequent task is scheduled onto the same worker, which takes x as an argument. After the task finishes, we call PopAndClearLocalBorrowers, which erases the stored_in map of the reference table and sends it to the caller. But the caller is not the owner of x--- the real owner of x has yet to send the WaitForRefRemoved RPC.
  3. The WaitForRefRemoved RPC from the block_owner arrives at the same worker. There are no local refs to x any more, and the stored_in map is empty. The owner thinks x can be deleted, however the caller still has a reference to x---> raising ReferenceCountingAssertionError.

The proposed fix here is to not clear stored_in in PopAndClearLocalBorrowers().

Related issue number

This fixes test_dataset_pipeline.py::pipeline_actors and test_dataset.py::test_callable_classes (both mixing tasks/actors returning/consuming the same object ref) in #19907.

// Don't clear stored_in values, which may be from previous tasks that
// created this same object id.
RAY_CHECK(GetAndClearLocalBorrowersInternal(borrowed_id, &borrowed_refs,
/*clear_stored=*/false))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix.

// This should be the first time that we have stored this object ID
// inside this return ID.
RAY_CHECK(inserted);
RAY_UNUSED(inner_it->second.stored_in_objects.emplace(object_id, owner_address));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked into why this check removal was needed.

@ericl ericl changed the title [RFC] Reference counting bug when the object ref transits the same worker twice [RFC] Reference counting bug when the object ref transits the same worker twice as a nested return and then arg Oct 30, 2021
@ericl ericl changed the title [RFC] Reference counting bug when the object ref transits the same worker twice as a nested return and then arg [RFC] Reference counting bug when the object ref transits the same worker as a nested return and then arg Oct 30, 2021
@stephanie-wang
Copy link
Copy Markdown
Contributor

I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases?

What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition.

@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Nov 1, 2021

I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases?

Hmm, could we address this by only clearing certain entries from stored_in_objects (e.g., filtering on where the owner == the owner of the current task?)

What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition.

That would race though, since the outer owner isn't aware of the object until the task finishes, unless we add logic to fix that up in this case. I also think it's fundamentally not addressing the issue of store_in being incorrectly cleared and returned to the wrong process.

@stephanie-wang
Copy link
Copy Markdown
Contributor

I'm a bit worried this will cause a leak in other cases where we actually should clear the stored_in_objects field. When exactly would it get cleared in non-ownership transfer cases?

Hmm, could we address this by only clearing certain entries from stored_in_objects (e.g., filtering on where the owner == the owner of the current task?)

That sounds a bit complicated... I think that case happens normally too, not just when there is ownership change.

What if we send the caller's address and outer ID to the new owner during the AssignObjectOwner RPC? Then the new owner should send WaitForRefRemoved to both the previous owner and the task caller, and we wouldn't have a race condition.

That would race though, since the outer owner isn't aware of the object until the task finishes, unless we add logic to fix that up in this case. I also think it's fundamentally not addressing the issue of store_in being incorrectly cleared and returned to the wrong process.

I don't think that's true, the new owner tells the outer object owner that it has a nested ObjectID during the WaitForRefRemoved RPC (that's why we have the contained_in_id arg).

// foreign owner from learning about the parent task borrowing this value.
if (!it->second.foreign_owner_already_monitoring) {
it->second.stored_in_objects.clear();
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix.

@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Nov 2, 2021

Per offline discussion, updated to track when an object is created with a foreign owner process explicitly, and skip clearing in that case. This is a light-weight version of tracking what metadata was added by the task specifically; that might be a needed refactoring but would be a much larger change.

Open to suggestions on how to write a unit tests here; otherwise this is covered by test_dataset.py::test_callable_classes which will immediately fail without this PR.

@stephanie-wang
Copy link
Copy Markdown
Contributor

Per offline discussion, updated to track when an object is created with a foreign owner process explicitly, and skip clearing in that case. This is a light-weight version of tracking what metadata was added by the task specifically; that might be a needed refactoring but would be a much larger change.

Open to suggestions on how to write a unit tests here; otherwise this is covered by test_dataset.py::test_callable_classes which will immediately fail without this PR.

It should be possible to write this with a unit test. Let me know if you need help with it. Here's an example that might be useful to copy.

owner->rc_.RemoveLocalReference(return_id2, nullptr);
ASSERT_FALSE(owner->rc_.HasReference(inner_id));
ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id));
ASSERT_FALSE(caller->rc_.HasReference(inner_id));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang need a bit a bit of help figuring out why this last assert is failing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this line, but the line above should fail because the foreign_owner has to send another WaitForRefRemoved RPC to caller (can trigger the handler with caller->FlushBorrowerCallbacks()).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this was a real reference leak (had to change the logic to not return the borrower reference in the foreign owner case).

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good! I had some questions about the test case.

owner->rc_.RemoveLocalReference(return_id2, nullptr);
ASSERT_FALSE(owner->rc_.HasReference(inner_id));
ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id));
ASSERT_FALSE(caller->rc_.HasReference(inner_id));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this line, but the line above should fail because the foreign_owner has to send another WaitForRefRemoved RPC to caller (can trigger the handler with caller->FlushBorrowerCallbacks()).

@stephanie-wang stephanie-wang added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 2, 2021
Copy link
Copy Markdown
Contributor Author

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks for the tips on the test!

owner->rc_.RemoveLocalReference(return_id2, nullptr);
ASSERT_FALSE(owner->rc_.HasReference(inner_id));
ASSERT_FALSE(foreign_owner->rc_.HasReference(inner_id));
ASSERT_FALSE(caller->rc_.HasReference(inner_id));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this was a real reference leak (had to change the logic to not return the borrower reference in the foreign owner case).

@ericl ericl removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 3, 2021
it->second.borrowers.clear();
it->second.stored_in_objects.clear();
if (for_ref_removed || !it->second.foreign_owner_already_monitoring) {
borrowed_refs->emplace(object_id, it->second);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to also move this into the conditional, to avoid a reference leak.

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's great to have more unit testing on this codepath.

@ericl ericl merged commit 28d4cfb into ray-project:master Nov 3, 2021
stephanie-wang added a commit that referenced this pull request Feb 4, 2022
…ct from going out of scope (#21719)

When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.
stephanie-wang added a commit that referenced this pull request Feb 8, 2022
…ct from going out of scope (#22120)

When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in #19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in #19910, but generalized to all initial ObjectRefs.

This is a re-merge for #21719 with a fix for removing the owned object ref if creation fails.
simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
…ct from going out of scope (ray-project#22120)

When a Ray program first creates an ObjectRef (via ray.put or task call), we add it with a ref count of 0 in the C++ backend because the language frontend will increment the initial local ref once we return the allocated ObjectID, then delete the local ref once the ObjectRef goes out of scope. Thus, there is a brief window where the object ref will appear to be out of scope.

This can cause problems with async protocols that check whether the object is in scope or not, such as the previous bug fixed in ray-project#19910. Now that we plan to enable lineage reconstruction to automatically recover lost objects, this race condition can also be problematic because we use the ref count to decide whether an object needs to be recovered or not.

This PR avoids these race conditions by incrementing the local ref count in the C++ backend when executing ray.put() and task calls. The frontend is then responsible for skipping the initial local ref increment when creating the ObjectRef. This is the same fix used in ray-project#19910, but generalized to all initial ObjectRefs.

This is a re-merge for ray-project#21719 with a fix for removing the owned object ref if creation fails.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants