[Core] Worker hangs when actor handles go out of scope before actor information is registered to GCS. by rkooo567 · Pull Request #8679 · ray-project/ray

rkooo567 · 2020-05-29T23:42:37Z

Why are these changes needed?

Overview

Once actor handle is created, actor information is persisted to GCS after the owner resolves its local dependencies and send actor creation task. If the owner exits (either node or worker failure) at this moment and the actor handle goes out of scope, the actor dies (except detached actors), but there's no way for a caller to know the actor is actually dead (because information is not in GCS yet). If we call ray.get on this dead actor, the caller will hang as it never knows the actor is dead.

This PR handles the issue.

Term

resolve location: All the actor handles that don't get reported its actor entries from GCS will need to resolve its locations.

Protocol

Every actor will have new information is_persisted_to_gcs that is set to be false when it is created.
Whenever core_worker receives a notification from gcs about actor state, this means the actor info is persisted. Then make is_persisted_to_gcs true. These handles will be treated like normal actor handles.
For actor ids that haven't been resolved, we add them to actors_pending_location_resolution_.
For every 3 seconds we check if 1. owner's worker has failed 2. owner's node has failed.
If one of them are satisfied, we check the actor entry at GCS. We need this step to avoid race condition. (If owner dies right after it creates an actor, and if a core worker gets this failure information before it receives actor created notification, it will just disconnect an actor although it has persisted to GCS). This is problematic for detached actors because detached actors can be alive although its owner's worker or node is failed.
If it turns out actors are not persisted to GCS at this point, just disconnect actors.

Caveat of this approach

If drivers have lots of long running dependencies for creating actors, it can increase load to GCS. All the actor handles that don't resolve locations will poll GCS twice per 3 seconds.
A couple race conditions are hard to be tested as it requires very subtle timing. We should heavily test this using unit tests. There will be another PR that will refactor all the actor handle logic into actor_manager to test this.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

AmplabJenkins · 2020-05-29T23:49:15Z

Can one of the admins verify this patch?

src/ray/core_worker/actor_handle.h

src/ray/core_worker/core_worker.cc

AmplabJenkins · 2020-05-30T02:04:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26538/
Test PASSed.

AmplabJenkins · 2020-06-01T11:54:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26571/
Test FAILed.

src/ray/core_worker/actor_handle.h

stephanie-wang · 2020-06-01T18:14:53Z

src/ray/core_worker/core_worker.cc

  reference_counter_->AddLocalReference(actor_creation_return_id, CurrentCallSite());
  direct_actor_submitter_->AddActorQueueIfNotExists(actor_id);

+  if (!actor_handle->IsPersistedToGCS()) {


We should do this after the insertion, in case we already have a handle to this actor that's been resolved.

That totally makes sense. I will fix it.

Moved to inside if (inserted)

src/ray/core_worker/core_worker.cc

stephanie-wang · 2020-06-01T18:29:00Z

src/ray/core_worker/core_worker.cc

+          [this, actor_id](Status status,
+                           const boost::optional<gcs::WorkerFailureData> &result) {
+            if (status.ok() && result) {
+              direct_actor_submitter_->DisconnectActor(actor_id, true);


There are two race conditions here that I think we need to handle:

Since GCS notifications are async, I think it's possible that the actor's location has just been registered, but then the owner dies concurrently, and we receive the lookup reply before the actor's location. I'm actually not sure if this can happen based on the GCS's current implementation, but I think it would be good in general not to depend on ordering between different GCS tables. So here, we should actually also do a lookup to the actor's entry to make sure it hasn't been added yet. The GCS service should also be checking whether the owner has already died when it receives an actor registration request so that it can drop the request, but you can leave that out for now (maybe add a TODO if you don't address in this PR).

We need to make sure that the GCS persisted flag is still unset in the actor handle when we disconnect the actor. Unfortunately, the mutex internal to the actor handle won't be enough since the caller can't hold the mutex while calling DisconnectActor.

Addressed both!

src/ray/core_worker/core_worker.h

AmplabJenkins · 2020-06-01T19:51:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26585/
Test FAILed.

AmplabJenkins · 2020-06-01T20:56:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26595/
Test FAILed.

AmplabJenkins · 2020-06-02T03:05:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26608/
Test FAILed.

AmplabJenkins · 2020-06-02T07:44:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/26619/
Test PASSed.

src/ray/protobuf/gcs_service.proto

src/ray/gcs/accessor.h

AmplabJenkins · 2020-07-07T04:00:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27942/
Test FAILed.

AmplabJenkins · 2020-07-07T11:10:11Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27955/
Test PASSed.

stephanie-wang

Looks good, just some nits!

src/ray/core_worker/actor_handle.h

src/ray/core_worker/actor_handle.cc