Revert "Revert "[core] Fix wrong local resource view in raylet (#1991…#19996
Revert "Revert "[core] Fix wrong local resource view in raylet (#1991…#19996fishbone merged 5 commits intoray-project:masterfrom
Conversation
…roject#19911)" (ray-project#19992)" This reverts commit f1eedb1.
| << ". Updating resource map. skip=" << (node_id == self_node_id_); | ||
| } | ||
|
|
||
| if (node_id == self_node_id_) { |
There was a problem hiding this comment.
Just curious, what is the mechanism that processing resource deletion for this node slows down the test and makes it flaky?
There was a problem hiding this comment.
oh, actually it's not slowing down the test. it's hanging and timeout eventually.
Basically, the local resource was deleted before scheduling and raylet found no resource to run the job.
Luckily, I can always reproduce it locally.
|
windows test failure is flaky. mac build hangs forever. everything else looks good. |
| // Updating local node could result in a inconsistence view in cluster resource | ||
| // scheduler which could make task hang. | ||
| if (node_id == self_node_id_) { | ||
| return; |
There was a problem hiding this comment.
Consider adding cluster_task_manager_->ScheduleAndDispatchTasks(); here?
There was a problem hiding this comment.
It seems like some tests are still failing after this PR, and this is the only behavior change I can imagine (since changes below seem to be only applied to grpc resource broadcast). I think calling the function one more time here won't hurt the performance
This reverts commit f1eedb1.
Why are these changes needed?
Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.