Fix the bug of unregistered workers in worker pool#7343
Fix the bug of unregistered workers in worker pool#7343jovany-wang merged 14 commits intoray-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
|
Test FAILed. |
b8ffa9e to
26dedc2
Compare
|
Test FAILed. |
|
Test FAILed. |
418dfc6 to
fcd1a06
Compare
|
Test PASSed. |
|
Test FAILed. |
|
Test PASSed. |
0181a65 to
abb77fe
Compare
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
kfstorm
left a comment
There was a problem hiding this comment.
Thanks for the fix! LGTM. Please fix the checkstyle warnings though.
|
Test PASSed. |
|
Also, consider changing the title to something like |
raulchen
left a comment
There was a problem hiding this comment.
Looks good to me. Please fix the small issues before merging.
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
Co-Authored-By: Hao Chen <chenh1024@gmail.com>
|
Test FAILed. |
|
Test PASSed. |
|
Test FAILed. |
|
Test FAILed. |
* Fix * Fix * Fix complie * Fix lint * Fix linting * Fix testDeleteObject * Fix linting * Update src/ray/raylet/worker_pool.cc Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update src/ray/raylet/worker_pool.cc Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update src/ray/raylet/worker_pool.h Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Update src/ray/raylet/worker_pool.cc Co-Authored-By: Hao Chen <chenh1024@gmail.com> * Address comments. * FIx linting Co-authored-by: Hao Chen <chenh1024@gmail.com>
What's the issue
The reason for Java CI issues is figured out:
In
CheckpointableTestandReconstrcutionTest, we will kill a worker process to trigger the failover of actor. There're multiple worker threads in a worker process, once we kill a worker process which has some worker threads not registered to the raylet, the worker threads will be bimzie workers. ThenStartWorkerProcesswill return early at https://github.com/ray-project/ray/blob/master/src/ray/raylet/worker_pool.cc#L137This also fix another case
testDeleteObjectin direct call, otherwise the ci couldn't pass.How to Fix
Add a timer for worker process to check if the worker is timeout to register to raylet.