Skip to content

[core] Fix wrong local resource view in raylet#19911

Merged
fishbone merged 28 commits intoray-project:masterfrom
fishbone:grpc-broadcasting-race-condition
Nov 2, 2021
Merged

[core] Fix wrong local resource view in raylet#19911
fishbone merged 28 commits intoray-project:masterfrom
fishbone:grpc-broadcasting-race-condition

Conversation

@fishbone
Copy link
Copy Markdown
Contributor

@fishbone fishbone commented Oct 30, 2021

Why are these changes needed?

When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent.

  1. local node has used all some pg resource
  2. gcs broadcast node resources
  3. local node now have resources
  4. scheduler picks local node
  5. local node can't schedule the task
  6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs

Related issue number

#19438

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone
Copy link
Copy Markdown
Contributor Author

I don't have a good idea about how to test this since it's related to grpc and raylet.
I also don't know how to construct a unit test for this, since it's has a lot of components involved. :(
Any good suggestions about how to test this one?

If the code looks good, I'd prefer to merge it first so that we can observe it from the nightly test to see whether it's working or not.

@fishbone
Copy link
Copy Markdown
Contributor Author

I'm also building wheel to test it in testing env to make sure it's working.

@fishbone fishbone marked this pull request as draft October 30, 2021 08:40
@fishbone
Copy link
Copy Markdown
Contributor Author

still hangs sometimes.. investigating

@fishbone fishbone changed the title [core] Fix race condition in raylet which result in scheduling hang [wip][core] Fix race condition in raylet which result in scheduling hang Oct 31, 2021
@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Oct 31, 2021

I've run it several times in test env and things look good. I'm trying to inject latency in gcs to test the change here.

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Oct 31, 2021

This PR might relate to this task #19326 @scv119

Copy link
Copy Markdown
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! So, the bug is the local information becomes inconsistent because we update the local resources by the resource update message from GCS, which is more stale right?

I think how we can test this is to ensure local resources are never updated by GCS reports from cpp unit tests, but it seems not simple with the current code structure (maybe we should move all resource update code path into other class or something).

@rkooo567
Copy link
Copy Markdown
Contributor

rkooo567 commented Nov 1, 2021

Btw, the CI seems to have the compiler failure. Can you fix that? (I'd like to see if it will pass all CI tests)

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Nov 1, 2021

Thanks, @rkooo567
I actually implemented a simple asio defer execution framework for testing... but then realize that I can reproduce it more simply by just increasing reporting interval

I'll update this PR with py test and then maybe have another PR for the testing framework which might be useful for other usage.

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Nov 1, 2021

Nice catch! So, the bug is the local information becomes inconsistent because we update the local resources by the resource update message from GCS, which is more stale right?

Actually, it's an existing bug but easy to trigger by grpc resource broadcasting.

Difference between grpc based and redis based:
redis based: always send it when node resource changed.
grpc based: send it with resource broadcasting (100ms)

For redis based one, since we haven't eat all the local resources, it's ok since the next schedule will update it to the correct one.

For grpc based one, since it'll delay 100ms, all local resource will be eat and the next update will change the local resource to be the one with full resources. So it'll hang since ray will always select the local instead of remote.

@fishbone fishbone changed the title [wip][core] Fix wrong local resource view in raylet [core] Fix wrong local resource view in raylet Nov 1, 2021
Copy link
Copy Markdown
Member

@mwtian mwtian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Nov 1, 2021
@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Nov 1, 2021

test failure looks wired... all of them can't be reproduced locally

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Nov 1, 2021

test failure looks related to the master failure. holding this pr

@fishbone
Copy link
Copy Markdown
Contributor Author

fishbone commented Nov 2, 2021

verified the failure looks the same as trunk failure and also windows failure (the one before merge) is the same as trunk failure.
lint failure is also not related to this pr and failed on the trunk as well.

I need this to run on nightly test for several days.

@fishbone fishbone merged commit a907168 into ray-project:master Nov 2, 2021
rkooo567 added a commit to rkooo567/ray that referenced this pull request Nov 2, 2021
fishbone pushed a commit that referenced this pull request Nov 2, 2021
)

This reverts commit a907168.

## Why are these changes needed?

This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds. 

## Related issue number
fishbone added a commit to fishbone/ray that referenced this pull request Nov 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants