GCS server error handling for actor creation#8899
GCS server error handling for actor creation#8899raulchen merged 29 commits intoray-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
Test PASSed. |
|
Test PASSed. |
892fe29 to
ab9d2be
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
Test PASSed. |
There was a problem hiding this comment.
No need to record every completed task. We just need to check: 1) the new task is an actor creation task; 2) this worker is already an actor; 3) the actor id matches.
There was a problem hiding this comment.
Now we can put the code for leasing worker (Line 65) right here. And should update DB after leasing a worker.
|
Test FAILed. |
86689f2 to
90d0919
Compare
|
Test FAILed. |
There was a problem hiding this comment.
| def test_gcs_server_re_schedule_actor(): | |
| def test_gcs_server_restart_during_actor_creation(): |
There was a problem hiding this comment.
this is not needed any more?
src/ray/core_worker/core_worker.cc
Outdated
There was a problem hiding this comment.
| RAY_LOG(DEBUG) << "Execute task, task info = " << task_spec.DebugString(); | |
| RAY_LOG(DEBUG) << "Executing task, task info = " << task_spec.DebugString(); |
There was a problem hiding this comment.
| // If GCS server is restarted after sending create actor task to the core worker, the | |
| // If GCS server is restarted after sending an actor creation task to this core worker, the |
There was a problem hiding this comment.
Can you add a check to make sure the actor isn't associated with any node id and work id?
There was a problem hiding this comment.
Don't need to call Schedule here, just call LeaseWorkerFromNode
src/ray/gcs/gcs_server/gcs_server.cc
Outdated
There was a problem hiding this comment.
| auto on_done = [this]() { | |
| auto actor_manager_load_initial_data_callback = [this]() { |
Avoid using the same var name. And document that actor data needs be to loaded lastly.
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
| local_queues_.RemoveTasks(task_ids); | |
| local_queues_.RemoveTasks({task_id}); |
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
Just realize that we should only do this for actor creation tasks when gcs service is enabled.
There was a problem hiding this comment.
Submitted actor creation task ... is already queue. This is most likely due to a GCS restart. We will remove the old one from the queue, and enqueue the new one.
90d0919 to
8eeff6f
Compare
|
Test FAILed. |
8eeff6f to
09a936a
Compare
|
Test FAILed. |
|
Test PASSed. |
There was a problem hiding this comment.
| // the restarted GCS server will send the same create actor task to the core worker | |
| // the restarted GCS server will send the same actor creation task to the core worker |
There was a problem hiding this comment.
If this is only for testing, it'd be better to just provide a GetCreatedActors method and do the loop in test code. So people won't accidentally use this.
There was a problem hiding this comment.
| << " owners a leased worker. Create actor directly on worker."; | |
| << " is already tied to a leased worker. Create actor directly on worker."; |
src/ray/raylet/node_manager.cc
Outdated
There was a problem hiding this comment.
if (spec.IsActorCreationTask()) {
// ...
} else {
// same as before.
}
d2169c6 to
8f86e29
Compare
|
Test FAILed. |
| @@ -226,20 +226,16 @@ void GcsActorScheduler::HandleWorkerLeasedReply( | |||
| // node, and then try again on the new node. | |||
| RAY_CHECK(!retry_at_raylet_address.raylet_id().empty()); | |||
| actor->UpdateAddress(retry_at_raylet_address); | |||
There was a problem hiding this comment.
if (auto node = gcs_node_manager_.GetNode(actor->GetNodeID())) {
} else {
}
|
Test PASSed. |
|
LGTM! |
|
Test FAILed. |
|
Test PASSed. |
src/ray/protobuf/gcs.proto
Outdated
| double timestamp = 13; | ||
| // The task specification of this actor's creation task. | ||
| TaskSpec task_spec = 14; | ||
| // Resource mapping ids assigned to the worker executing the task. |
There was a problem hiding this comment.
// Resource mapping ids acquired by the leased worker. This field is only set when this actor already has a leased worker.
| ASSERT_EQ(actor->GetWorkerID(), worker_id); | ||
| } | ||
|
|
||
| TEST_F(GcsActorSchedulerTest, TestReschedule) { |
There was a problem hiding this comment.
should test 2 cases: 1) actor is already tied to a leased worker; 2) not tied.
|
Test FAILed. |
Why are these changes needed?
Design document:
https://docs.google.com/document/d/1Cuqxlw53abEZPVYVF-pUWpbnwrFEgVTel_sQGwt8DmU/edit#heading=h.s3ogmk8m7308
This PR implements actor management error handling. Lease worker resource recovery will be implemented in the next pr.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.