[core] Fix wrong local resource view in raylet#19911
[core] Fix wrong local resource view in raylet#19911fishbone merged 28 commits intoray-project:masterfrom
Conversation
|
I don't have a good idea about how to test this since it's related to grpc and raylet. If the code looks good, I'd prefer to merge it first so that we can observe it from the nightly test to see whether it's working or not. |
|
I'm also building wheel to test it in testing env to make sure it's working. |
|
still hangs sometimes.. investigating |
|
I've run it several times in test env and things look good. I'm trying to inject latency in gcs to test the change here. |
rkooo567
left a comment
There was a problem hiding this comment.
Nice catch! So, the bug is the local information becomes inconsistent because we update the local resources by the resource update message from GCS, which is more stale right?
I think how we can test this is to ensure local resources are never updated by GCS reports from cpp unit tests, but it seems not simple with the current code structure (maybe we should move all resource update code path into other class or something).
|
Btw, the CI seems to have the compiler failure. Can you fix that? (I'd like to see if it will pass all CI tests) |
|
Thanks, @rkooo567 I'll update this PR with py test and then maybe have another PR for the testing framework which might be useful for other usage. |
Actually, it's an existing bug but easy to trigger by grpc resource broadcasting. Difference between grpc based and redis based: For redis based one, since we haven't eat all the local resources, it's ok since the next schedule will update it to the correct one. For grpc based one, since it'll delay 100ms, all local resource will be eat and the next update will change the local resource to be the one with full resources. So it'll hang since ray will always select the local instead of remote. |
|
test failure looks wired... all of them can't be reproduced locally |
|
test failure looks related to the master failure. holding this pr |
|
verified the failure looks the same as trunk failure and also windows failure (the one before merge) is the same as trunk failure. I need this to run on nightly test for several days. |
) This reverts commit a907168. ## Why are these changes needed? This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds. ## Related issue number
…roject#19911)" (ray-project#19992)" This reverts commit f1eedb1.
Why are these changes needed?
When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent.
Related issue number
#19438
Checks
scripts/format.shto lint the changes in this PR.