Skip to content

[Core] Worker hangs when actor handles go out of scope before actor information is registered to GCS.#8679

Closed
rkooo567 wants to merge 67 commits intoray-project:masterfrom
rkooo567:cluster-hang-when-actor-not-registered-gcs
Closed

[Core] Worker hangs when actor handles go out of scope before actor information is registered to GCS.#8679
rkooo567 wants to merge 67 commits intoray-project:masterfrom
rkooo567:cluster-hang-when-actor-not-registered-gcs

Conversation

@rkooo567
Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 commented May 29, 2020

Why are these changes needed?

Overview

Once actor handle is created, actor information is persisted to GCS after the owner resolves its local dependencies and send actor creation task. If the owner exits (either node or worker failure) at this moment and the actor handle goes out of scope, the actor dies (except detached actors), but there's no way for a caller to know the actor is actually dead (because information is not in GCS yet). If we call ray.get on this dead actor, the caller will hang as it never knows the actor is dead.

This PR handles the issue.

Term

resolve location: All the actor handles that don't get reported its actor entries from GCS will need to resolve its locations.

Protocol

  • Every actor will have new information is_persisted_to_gcs that is set to be false when it is created.
  • Whenever core_worker receives a notification from gcs about actor state, this means the actor info is persisted. Then make is_persisted_to_gcs true. These handles will be treated like normal actor handles.
  • For actor ids that haven't been resolved, we add them to actors_pending_location_resolution_.
  • For every 3 seconds we check if 1. owner's worker has failed 2. owner's node has failed.
  • If one of them are satisfied, we check the actor entry at GCS. We need this step to avoid race condition. (If owner dies right after it creates an actor, and if a core worker gets this failure information before it receives actor created notification, it will just disconnect an actor although it has persisted to GCS). This is problematic for detached actors because detached actors can be alive although its owner's worker or node is failed.
  • If it turns out actors are not persisted to GCS at this point, just disconnect actors.

Caveat of this approach

  • If drivers have lots of long running dependencies for creating actors, it can increase load to GCS. All the actor handles that don't resolve locations will poll GCS twice per 3 seconds.
  • A couple race conditions are hard to be tested as it requires very subtle timing. We should heavily test this using unit tests. There will be another PR that will refactor all the actor handle logic into actor_manager to test this.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@rkooo567 rkooo567 changed the title [WIP] Cluster hang when actor not registered gcs [WIP] Worker hangs when actor handles go out of scope before actor information is registered to GCS. May 30, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26538/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26571/
Test FAILed.

reference_counter_->AddLocalReference(actor_creation_return_id, CurrentCallSite());
direct_actor_submitter_->AddActorQueueIfNotExists(actor_id);

if (!actor_handle->IsPersistedToGCS()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do this after the insertion, in case we already have a handle to this actor that's been resolved.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That totally makes sense. I will fix it.

Copy link
Copy Markdown
Contributor Author

@rkooo567 rkooo567 Jun 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to inside if (inserted)

[this, actor_id](Status status,
const boost::optional<gcs::WorkerFailureData> &result) {
if (status.ok() && result) {
direct_actor_submitter_->DisconnectActor(actor_id, true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two race conditions here that I think we need to handle:

  1. Since GCS notifications are async, I think it's possible that the actor's location has just been registered, but then the owner dies concurrently, and we receive the lookup reply before the actor's location. I'm actually not sure if this can happen based on the GCS's current implementation, but I think it would be good in general not to depend on ordering between different GCS tables. So here, we should actually also do a lookup to the actor's entry to make sure it hasn't been added yet. The GCS service should also be checking whether the owner has already died when it receives an actor registration request so that it can drop the request, but you can leave that out for now (maybe add a TODO if you don't address in this PR).
  2. We need to make sure that the GCS persisted flag is still unset in the actor handle when we disconnect the actor. Unfortunately, the mutex internal to the actor handle won't be enough since the caller can't hold the mutex while calling DisconnectActor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed both!

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26585/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26595/
Test FAILed.

@rkooo567 rkooo567 changed the title [WIP] Worker hangs when actor handles go out of scope before actor information is registered to GCS. Worker hangs when actor handles go out of scope before actor information is registered to GCS. Jun 2, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26608/
Test FAILed.

@rkooo567 rkooo567 changed the title Worker hangs when actor handles go out of scope before actor information is registered to GCS. [Core] Worker hangs when actor handles go out of scope before actor information is registered to GCS. Jun 2, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26619/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27942/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27955/
Test PASSed.

@rkooo567 rkooo567 requested a review from stephanie-wang July 7, 2020 18:29
Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some nits!

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27990/
Test FAILed.

Copy link
Copy Markdown
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

}

void ActorManager::ResolveActorsLocations() {
void ActorManager::MarkPendingLocationActorsFailed() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does !status.ok() mean in this case? Just wondering since I'm not sure what the semantics are for the GCS.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this just means something went wrong within the GCS server (when I read code). Not sure if it has consistent semantics.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28267/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28276/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28314/
Test PASSed.

@raulchen
Copy link
Copy Markdown
Contributor

Hi @rkooo567, IIRC, previously we discussed another approach for this problem. I couldn't find the previous doc. The protocol is:

  • When Actor.remote is called, the worker synchronously sends a request to GCS to register the actor information. (This ensures that the actor handle is created only after GCS saves actor info.)
  • GCS saves the actor info in DB, set actor state to PENDING. And GCS starts monitoring the caller worker.
  • When actor's dependencies are resolved, the worker sends another request to GCS (same as today), and update actor state to READY.
  • If the caller worker dies before dependencies are resolved, GCS will notice that and mark the actor as DEAD. Then other workers who hold the handles will receive the notification, and fail the requests.

I think this might be better, because workers don't need to periodically check GCS. And there is no race condition.

@wumuzi520
Copy link
Copy Markdown
Contributor

Hi @rkooo567
Just as you said, as long as there is an unresolved actor handler handle in the worker, it will send a request to GCS every 3S. When the number of actors is relatively large, it will cause great pressure on GCS server. Now, there are about 5000 + actors in a single job, and tens of thousands in multi tenant scenario, so this impact cannot be ignored

@rkooo567
Copy link
Copy Markdown
Contributor Author

rkooo567 commented Jul 14, 2020

@wumuzi520

It looks like the protocol you mentioned is similar to the previous one? #8045

Just note that this doesn't call GCS for every actor handle. It calls GCS for actor handles who are

  1. passed from an owner worker
  2. and owner worker is still resolving dependencies to create an actor.
  • Once location or failure is detected, it doesn't check their status from GCS anymore.

I highly doubt there are 5000 actor creation tasks that are pending because owners are waiting for dependencies, but if you guys are worried about it, optimizing this is not difficult. As @WangTaoTheTonic mentions, we can batch this call, or we can even just increase the period.

@raulchen
We decided to go with this approach because of several reasons.

  1. GCS actor management is already super vulnerable to race condition, I doubt there will be no race condition when we add a new state to it.
  2. This protocol won't work when the initial write to GCS fails, but actor handles are still passed. For example, if the owner sends a request to GCS that is restarting and it dies right away, it won't work. We should make the initial call to Redis synchronous.

So, in conclusion, I don't think one approach is absolutely better than the other. It seems to me that, If it turns out this approach actually gives lots of pressure to GCS, we can just optimize the current approach as the PR is already in completed stage.

@rkooo567
Copy link
Copy Markdown
Contributor Author

I changed the default checking time to be 10 seconds and make it configurable. Please let me know if this affects performance a lot. I will create a PR for optimization.

@stephanie-wang
Copy link
Copy Markdown
Contributor

Just to add on to what @rkooo567 already said, the pressure on the GCS can be easily reduced by sending an RPC to the owner directly instead of checking the worker table in the GCS. Plus with a 10s timeout, I have a hard time imagining a realistic workload where this would break.

I'd prefer to merge this one as is since this has been a longstanding issue for actor management stability. @raulchen, if you still believe this is a problem, please file a github issue with a reproducible script that illustrates the performance slowdown.

Copy link
Copy Markdown
Contributor

@raulchen raulchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the other protocol is better in theory, see my replies below.

GCS actor management is already super vulnerable to race condition, I doubt there will be no race condition when we add a new state to it.

This is not a real reason. We can't make it more vulnerable, because it's already vulnerable.

This protocol won't work when the initial write to GCS fails, but actor handles are still passed. For example, if the owner sends a request to GCS that is restarting and it dies right away, it won't work. We should make the initial call to Redis synchronous.

This won't happen, because the initial request is synchronous, GCS will only reply to the client after the writing to DB. If GCS dies before that, the client will retry.

the pressure on the GCS can be easily reduced by sending an RPC to the owner directly instead of checking the worker table in the GCS

This only considers supervised actors, we should also consider non-supervised actors.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28351/
Test FAILed.

@rkooo567
Copy link
Copy Markdown
Contributor Author

Dup with #8045

@rkooo567 rkooo567 closed this Jul 27, 2020
@rkooo567 rkooo567 deleted the cluster-hang-when-actor-not-registered-gcs branch February 17, 2022 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants