[core] Correctly fail worker lease request if a task becomes infeasible after scheduling#52295
[core] Correctly fail worker lease request if a task becomes infeasible after scheduling#52295jjyao merged 21 commits intoray-project:masterfrom
Conversation
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
src/ray/raylet/local_task_manager.cc
Outdated
| // This should ideally only happen when resources on the node are changed, | ||
| // and the only dynamic resource is placement group bundles. | ||
| for (const auto &work : dispatch_queue) { | ||
| CancelTask(work->task.GetTaskSpecification().TaskId(), | ||
| rpc::RequestWorkerLeaseReply::SCHEDULING_CANCELLED_UNSCHEDULABLE, | ||
| "Scheduling failed due to the task becoming infeasible."); | ||
| } |
There was a problem hiding this comment.
Actually after #51125 (comment), this should be a check failure.
There was a problem hiding this comment.
Ya makes sense, so these tasks should just be cancelled from node manager when resources are lost, and we should never hit this case.
I'm still a bit nervous about turning this into a check failure though because of signs of possibly hitting this case even without using placement groups. I can update the comment and do an RAY_LOG(ERROR) that this case should never be hit and reported as an issue on github maybe?
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
@jjyao to review |
src/ray/raylet/local_task_manager.cc
Outdated
| if (is_infeasible) { | ||
| // TODO(scv119): fail the request. | ||
| // Call CancelTask | ||
| TaskSpecification front_task = dispatch_queue.front()->task.GetTaskSpecification(); |
There was a problem hiding this comment.
Should we add a unit test even though we think it should never happen?
There was a problem hiding this comment.
Added a unit test, good thing you asked for a unit test, discovered a bug 💀
It would segfault before because the erase_if call erases the queue from the map if the queue size gets to 0, and the reference to the queue would be invalidated.
Signed-off-by: dayshah <dhyey2019@gmail.com>
| int num_callbacks_called = 0; | ||
| auto callback = [&num_callbacks_called]() { ++num_callbacks_called; }; | ||
| rpc::RequestWorkerLeaseReply reply1; | ||
| local_task_manager_->WaitForTaskArgsRequests(std::make_shared<internal::Work>( |
There was a problem hiding this comment.
We should use QueueAndScheduleTask to add the task. To avoid the check failure, we should update the total resource after the tasks are queued to make the task infeasible.
There was a problem hiding this comment.
done
the test above actually tests behavior that's impossible with the public api
Signed-off-by: dayshah <dhyey2019@gmail.com>
|
@jjyao ready for re-re-review |
Signed-off-by: dayshah <dhyey2019@gmail.com>
Why are these changes needed?
Here it's possible for the Work object to be erased from the dispatch queue without having the callback that replies to the WorkerLeaseRequest called (this happens when the task becomes infeasible after its added to the local task manager dispatch queue). The reply callback exists inside the Work object. The task submitter will then be left hanging forever while the reply callback is nowhere to be found.
Ideally this should only happen in the local task manager if resources total value change after the task is scheduled on a node and the only resource whose total can be dynamically changed is placement group bundles when the placement group is removed. However when placement group is removed, all tasks inside the local task manager should be cancelled immediately so this case should ideally not be hit.
Even though we think this case should never happen, we're deciding to log an error and reply instead of check fail since there are signs that it can possibly happen in a way that we are not aware of as of today.
To do this we also need to refactor a bit to have a
CancelTaskToDispatchthat doesn't erase from tasks_to_dispatch_.Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.