[Core] Worker hangs when actor handles go out of scope before actor information is registered to GCS.#8679
Conversation
|
Can one of the admins verify this patch? |
|
Test PASSed. |
|
Test FAILed. |
src/ray/core_worker/core_worker.cc
Outdated
| reference_counter_->AddLocalReference(actor_creation_return_id, CurrentCallSite()); | ||
| direct_actor_submitter_->AddActorQueueIfNotExists(actor_id); | ||
|
|
||
| if (!actor_handle->IsPersistedToGCS()) { |
There was a problem hiding this comment.
We should do this after the insertion, in case we already have a handle to this actor that's been resolved.
There was a problem hiding this comment.
That totally makes sense. I will fix it.
There was a problem hiding this comment.
Moved to inside if (inserted)
src/ray/core_worker/core_worker.cc
Outdated
| [this, actor_id](Status status, | ||
| const boost::optional<gcs::WorkerFailureData> &result) { | ||
| if (status.ok() && result) { | ||
| direct_actor_submitter_->DisconnectActor(actor_id, true); |
There was a problem hiding this comment.
There are two race conditions here that I think we need to handle:
- Since GCS notifications are async, I think it's possible that the actor's location has just been registered, but then the owner dies concurrently, and we receive the lookup reply before the actor's location. I'm actually not sure if this can happen based on the GCS's current implementation, but I think it would be good in general not to depend on ordering between different GCS tables. So here, we should actually also do a lookup to the actor's entry to make sure it hasn't been added yet. The GCS service should also be checking whether the owner has already died when it receives an actor registration request so that it can drop the request, but you can leave that out for now (maybe add a TODO if you don't address in this PR).
- We need to make sure that the GCS persisted flag is still unset in the actor handle when we disconnect the actor. Unfortunately, the mutex internal to the actor handle won't be enough since the caller can't hold the mutex while calling DisconnectActor.
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
stephanie-wang
left a comment
There was a problem hiding this comment.
Looks good, just some nits!
|
Test FAILed. |
| } | ||
|
|
||
| void ActorManager::ResolveActorsLocations() { | ||
| void ActorManager::MarkPendingLocationActorsFailed() { |
There was a problem hiding this comment.
What does !status.ok() mean in this case? Just wondering since I'm not sure what the semantics are for the GCS.
There was a problem hiding this comment.
I believe this just means something went wrong within the GCS server (when I read code). Not sure if it has consistent semantics.
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Hi @rkooo567, IIRC, previously we discussed another approach for this problem. I couldn't find the previous doc. The protocol is:
I think this might be better, because workers don't need to periodically check GCS. And there is no race condition. |
|
Hi @rkooo567 |
|
It looks like the protocol you mentioned is similar to the previous one? #8045 Just note that this doesn't call GCS for every actor handle. It calls GCS for actor handles who are
I highly doubt there are 5000 actor creation tasks that are pending because owners are waiting for dependencies, but if you guys are worried about it, optimizing this is not difficult. As @WangTaoTheTonic mentions, we can batch this call, or we can even just increase the period. @raulchen
So, in conclusion, I don't think one approach is absolutely better than the other. It seems to me that, If it turns out this approach actually gives lots of pressure to GCS, we can just optimize the current approach as the PR is already in completed stage. |
|
I changed the default checking time to be 10 seconds and make it configurable. Please let me know if this affects performance a lot. I will create a PR for optimization. |
|
Just to add on to what @rkooo567 already said, the pressure on the GCS can be easily reduced by sending an RPC to the owner directly instead of checking the worker table in the GCS. Plus with a 10s timeout, I have a hard time imagining a realistic workload where this would break. I'd prefer to merge this one as is since this has been a longstanding issue for actor management stability. @raulchen, if you still believe this is a problem, please file a github issue with a reproducible script that illustrates the performance slowdown. |
There was a problem hiding this comment.
I think the other protocol is better in theory, see my replies below.
GCS actor management is already super vulnerable to race condition, I doubt there will be no race condition when we add a new state to it.
This is not a real reason. We can't make it more vulnerable, because it's already vulnerable.
This protocol won't work when the initial write to GCS fails, but actor handles are still passed. For example, if the owner sends a request to GCS that is restarting and it dies right away, it won't work. We should make the initial call to Redis synchronous.
This won't happen, because the initial request is synchronous, GCS will only reply to the client after the writing to DB. If GCS dies before that, the client will retry.
the pressure on the GCS can be easily reduced by sending an RPC to the owner directly instead of checking the worker table in the GCS
This only considers supervised actors, we should also consider non-supervised actors.
|
Test FAILed. |
|
Dup with #8045 |
Why are these changes needed?
Overview
Once actor handle is created, actor information is persisted to GCS after the owner resolves its local dependencies and send actor creation task. If the owner exits (either node or worker failure) at this moment and the actor handle goes out of scope, the actor dies (except detached actors), but there's no way for a caller to know the actor is actually dead (because information is not in GCS yet). If we call
ray.geton this dead actor, the caller will hang as it never knows the actor is dead.This PR handles the issue.
Term
resolve location: All the actor handles that don't get reported its actor entries from GCS will need to resolve its locations.Protocol
is_persisted_to_gcsthat is set to be false when it is created.is_persisted_to_gcstrue. These handles will be treated like normal actor handles.actors_pending_location_resolution_.Caveat of this approach
Related issue number
Checks
scripts/format.shto lint the changes in this PR.