[core] Make WaitForActorRefDeleted RPC Fault Tolerant#57116
Conversation
Signed-off-by: joshlee <joshlee@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request makes the WaitForActorRefDeleted RPC fault-tolerant by enabling retries. This is a good improvement for the robustness of actor lifecycle management. The changes include updating the RPC client to be retryable, modifying the WaitForActorRefDeleted method signature to use move semantics for better performance, and adding comprehensive tests. The new C++ tests for idempotency on the server side are well-structured. I've suggested a small improvement to the new Python test to cover response failures in addition to request failures, and a minor style improvement in the C++ tests to mark unused lambda parameters for better readability. Overall, this is a solid contribution to Ray's fault tolerance.
| @pytest.mark.parametrize("deterministic_failure", ["request"]) | ||
| def test_wait_for_actor_ref_deleted_rpc_retry_and_idempotency( | ||
| monkeypatch, shutdown_only, deterministic_failure | ||
| ): | ||
| """Test that WaitForActorRefDeleted RPC retries work correctly. | ||
| Verify that the RPC is idempotent when network failures occur. | ||
| The GCS actor manager will trigger this RPC during actor initialization | ||
| to monitor when the actor handles have gone out of scope and the actor should be destroyed. | ||
| """ | ||
|
|
||
| monkeypatch.setenv( | ||
| "RAY_testing_rpc_failure", | ||
| "CoreWorkerService.grpc_client.WaitForActorRefDeleted=1:100:0", | ||
| ) |
There was a problem hiding this comment.
The test test_wait_for_actor_ref_deleted_rpc_retry_and_idempotency is only parameterized to test request failures. For better test coverage, it should also be parameterized to test response failures, similar to test_get_object_status_rpc_retry_and_idempotency.
Additionally, the deterministic_failure parameter is currently unused in the test body.
| @pytest.mark.parametrize("deterministic_failure", ["request"]) | |
| def test_wait_for_actor_ref_deleted_rpc_retry_and_idempotency( | |
| monkeypatch, shutdown_only, deterministic_failure | |
| ): | |
| """Test that WaitForActorRefDeleted RPC retries work correctly. | |
| Verify that the RPC is idempotent when network failures occur. | |
| The GCS actor manager will trigger this RPC during actor initialization | |
| to monitor when the actor handles have gone out of scope and the actor should be destroyed. | |
| """ | |
| monkeypatch.setenv( | |
| "RAY_testing_rpc_failure", | |
| "CoreWorkerService.grpc_client.WaitForActorRefDeleted=1:100:0", | |
| ) | |
| @pytest.mark.parametrize("deterministic_failure", ["request", "response"]) | |
| def test_wait_for_actor_ref_deleted_rpc_retry_and_idempotency( | |
| monkeypatch, shutdown_only, deterministic_failure | |
| ): | |
| """Test that WaitForActorRefDeleted RPC retries work correctly. | |
| Verify that the RPC is idempotent when network failures occur. | |
| The GCS actor manager will trigger this RPC during actor initialization | |
| to monitor when the actor handles have gone out of scope and the actor should be destroyed. | |
| """ | |
| monkeypatch.setenv( | |
| "RAY_testing_rpc_failure", | |
| "CoreWorkerService.grpc_client.WaitForActorRefDeleted=1:" | |
| + ("100:0" if deterministic_failure == "request" else "0:100"), | |
| ) |
| [&callback_count](Status s, | ||
| std::function<void()> success, | ||
| std::function<void()> failure) { callback_count++; }); |
There was a problem hiding this comment.
The lambda parameters s, success, and failure are unused. It's good practice to either omit their names or mark them as unused to improve readability and signal intent.
| [&callback_count](Status s, | |
| std::function<void()> success, | |
| std::function<void()> failure) { callback_count++; }); | |
| [&callback_count](Status /*s*/, | |
| std::function<void()> /*success*/, | |
| std::function<void()> /*failure*/) { callback_count++; }); |
| [&callback_count](Status s, | ||
| std::function<void()> success, | ||
| std::function<void()> failure) { callback_count++; }); |
There was a problem hiding this comment.
The lambda parameters s, success, and failure are unused. It's good practice to either omit their names or mark them as unused to improve readability and signal intent.
| [&callback_count](Status s, | |
| std::function<void()> success, | |
| std::function<void()> failure) { callback_count++; }); | |
| [&callback_count](Status /*s*/, | |
| std::function<void()> /*success*/, | |
| std::function<void()> /*failure*/) { callback_count++; }); |
| rpc::WaitForActorRefDeletedReply reply2; | ||
|
|
||
| // For requests containing the same ActorID, the send_reply_callback is overwritten | ||
| // and only the last callback is triggered. |
There was a problem hiding this comment.
should assert false in the first callback and assert true in the second callback
There was a problem hiding this comment.
Put an ASSERT_TRUE(false) in the first callback since I wanted to guarantee it wouldn't be called.
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
…-for-actor-ref-deleted-fault-tolerant
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
| assert final_result == [0, 2, 4, 6, 8] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("deterministic_failure", ["request"]) |
There was a problem hiding this comment.
I think it's a bit tricky to test response because there's another RPC that the owner uses to notify the GCS that the actor goes out of scope: AsyncReportActorOutOfScope so even if the response is lost the GCS will still deallocate the actor.
Request is easy because if it's lost the callback immediately calls DestroyActor so we get an "actor creation cancelled" error
There was a problem hiding this comment.
can you put a comment there so the person after knows
|
|
||
| result = ray.get(actor.ping.remote()) | ||
| assert result == "pong" | ||
|
|
There was a problem hiding this comment.
we should do
del actor
// check to make sure actor is marked as dead with cause equals REF_DELETED
There was a problem hiding this comment.
AsyncReportActorOutOfScope will report this, note the prior comment
There was a problem hiding this comment.
I mean we should add some checks to make sure actor is marked as dead with cause equals REF_DELETED
Signed-off-by: joshlee <joshlee@anyscale.com>
| return false; | ||
| } | ||
| it->second.on_object_ref_delete = callback; | ||
| it->second.object_ref_deleted_callbacks.push_back(callback); |
There was a problem hiding this comment.
| it->second.object_ref_deleted_callbacks.push_back(callback); | |
| it->second.object_ref_deleted_callbacks.push_back(std::move(callback)); |
There was a problem hiding this comment.
Makes sense, done
| /// Returns true if the object was in the reference table and the callback was added | ||
| /// else false. | ||
| bool SetObjectRefDeletedCallback(const ObjectID &object_id, | ||
| bool AddObjectRefDeletedCallback(const ObjectID &object_id, |
There was a problem hiding this comment.
Ah shoot, thanks for catching
…-for-actor-ref-deleted-fault-tolerant
Signed-off-by: joshlee <joshlee@anyscale.com>
…-for-actor-ref-deleted-fault-tolerant
There was a problem hiding this comment.
Bug: Missing `const` Qualifier in Callback Parameter
The AddObjectRefDeletedCallback override in ReferenceCounter has a callback parameter that's missing the const qualifier, which differs from the virtual method in ReferenceCounterInterface. This signature mismatch prevents proper overriding and will cause compilation errors.
src/ray/core_worker/reference_counter.h#L48-L51
ray/src/ray/core_worker/reference_counter.h
Lines 48 to 51 in ad1cdd2
| wait_request, | ||
| auto client = worker_client_pool_.GetOrConnect(it->second.address_); | ||
| client->WaitForActorRefDeleted( | ||
| std::move(wait_request), |
There was a problem hiding this comment.
yea and the client pool too, chill AI
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
|
|
||
| /// The callback for each request is stored in the reference counter due to retries | ||
| /// and message reordering where the callback of the retry of the request could be | ||
| /// overwritten by the callback of the initial request. |
There was a problem hiding this comment.
Bug: Double Move of respond Causes Undefined Behavior
The respond callback is moved into the AsyncWaitForActorRegisterFinish lambda capture. If actor registration succeeds, the respond object, already in a moved-from state, is moved a second time into WaitForActorRefDeleted, causing undefined behavior.
There was a problem hiding this comment.
it got the problem wrong, the problem is actually just that the second move doesn't really move because function captures are const by default
dayshah
left a comment
There was a problem hiding this comment.
also update the Failure line in the .proto file, not sure if we've missed that for more rpc's...
just a nit
| } | ||
| if (it->second.on_object_ref_delete) { | ||
| it->second.on_object_ref_delete(it->first); | ||
| if (it->second.object_ref_deleted_callbacks.size() > 0) { |
There was a problem hiding this comment.
why do you need this if, it's the same without it
There was a problem hiding this comment.
oh bleh good catch thanks
| wait_request, | ||
| auto client = worker_client_pool_.GetOrConnect(it->second.address_); | ||
| client->WaitForActorRefDeleted( | ||
| std::move(wait_request), |
UHHH yea it's a little outdated lol, I went through the list on the doc and updated all the missing ones |
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Why are these changes needed?
Making WaitForActorRefDeleted RPC Fault Tolerant. Added c++ unit tests to verify idempotency of HandleWaitForActorRefDeleted, and added a python integration test.
Related issue number
Closes #53797
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Cursor Bugbot is generating a summary for commit e3234ca. Configure here.