-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Core] Transient network failure on RPC WaitForActorRefDeleted causes actor registration fail #53797
Description
What happened + What you expected to happen
When running ray in our cluster, we observed a bug where transient network failure on RPC WaitForActorRefDeleted caused actor registration failed. Here is the concrete call site of the RPC that encountered transient error. call_site.log
Here is the output error message.
E class_name: test_actor_api.<locals>.Foo
E actor_id: 9ba41b4ddd98b0554ef05f0204000000
E namespace: 345c7114-a740-49cd-9c7f-af89bb3dea24
E The actor is dead because all references to the actor were removed including lineage ref count.
E The actor never ran - it was cancelled before it started running.
By analyzing the concrete call site, we noticed that the callback function of WaitForActorRefDeleted calls DestroyActor without checking return status. Maybe the root cause of this bug is that transient error caused GCS destroyed the actor without waiting for it, resulting in that the actor is cancelled before it started running.
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Lines 1060 to 1066 in 7337f2a
| if (node_it != owners_.end() && node_it->second.count(owner_id)) { | |
| // Only destroy the actor if its owner is still alive. The actor may | |
| // have already been destroyed if the owner died. | |
| DestroyActor(actor_id, | |
| GenActorRefDeletedCause(GetActor(actor_id)), | |
| /*force_kill=*/true); | |
| } |
The expected behavior is that transient network failure can be handled and the job is executed properly.
Versions / Dependencies
Ray 3.0.0.dev, Kuberay 1.3.0
Reproduction script
Start a RayCluster using Kuberay. Then run the following script with ray job submit SDK.
import ray
ray.init()
@ray.remote
class Foo:
def __init__(self, val):
self.x = val
def get(self):
return self.x
x = 1
f = Foo.remote(x)
assert ray.get(f.get.remote()) == xTransient network failure can be reproduced with gRPC interceptor.
Issue Severity
Medium: It is a significant difficulty but I can work around it.