Skip to content

[refactor cluster-task-manage 1/n] separate resource reporting logic into helper class#22215

Merged
scv119 merged 3 commits intoray-project:masterfrom
scv119:resource-reporting
Feb 9, 2022
Merged

[refactor cluster-task-manage 1/n] separate resource reporting logic into helper class#22215
scv119 merged 3 commits intoray-project:masterfrom
scv119:resource-reporting

Conversation

@scv119
Copy link
Copy Markdown
Contributor

@scv119 scv119 commented Feb 8, 2022

Why are these changes needed?

Separate Scheduler Resource Reporting logic into a separate class for better readability and maintainability.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@scv119 scv119 marked this pull request as ready for review February 8, 2022 21:46
@scv119
Copy link
Copy Markdown
Contributor Author

scv119 commented Feb 8, 2022

Also we think there are two bugs in the scheduler resource reporting logic:

  1. The autoscaler expects that the resource_demands is unique per resource shape. however the way scheduler resource reporting implemented didn't dedupe the resource shapes between tasks_to_schedule_ and tasks_to_dispatch_.

  2. We report client side pending requests from worker to local raylet (https://github.com/ray-project/ray/blob/master/src/ray/raylet/scheduling/cluster_task_manager.cc#L843); however in the case that the lease request is sent to a remote raylet (spilling, or locality aware scheduling), this might under count the resource requirement.

cc @wuisawesome @jjyao

namespace raylet {

/// Helper class that reports resource_load and resource_load_by_shape to gcs.
class SchedulerResourceReporter {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this local scheduler or cluster?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is both

@fishbone
Copy link
Copy Markdown
Contributor

fishbone commented Feb 8, 2022

Just want to make sure I understand it correctly, after this one,
src/ray/raylet/scheduling/scheduler_resource_reporter.h basically will be something like the Reporter concept we talked about before, right?

@jjyao
Copy link
Copy Markdown
Contributor

jjyao commented Feb 9, 2022

We report client side pending requests from worker to local raylet (https://github.com/ray-project/ray/blob/master/src/ray/raylet/scheduling/cluster_task_manager.cc#L843); however in the case that the lease request is sent to a remote raylet (spilling, or locality aware scheduling), this might under count the resource requirement.

We only report tasks without pending lease requests to local raylet. For tasks with pending lease requests, they will be reported by the raylets that receive the lease requests. Combined together, gcs should see an almost accurate resource requirement.

@scv119 scv119 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Feb 9, 2022
@scv119 scv119 merged commit 1abe69e into ray-project:master Feb 9, 2022
@wuisawesome
Copy link
Copy Markdown
Contributor

The autoscaler expects that the resource_demands is unique per resource shape. however the way scheduler resource reporting implemented didn't dedupe the resource shapes between tasks_to_schedule_ and tasks_to_dispatch_.

Let's remove that comment. The code doesn't seem to make that assumption. https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/autoscaler/_private/monitor.py?L85

We report client side pending requests from worker to local raylet (https://github.com/ray-project/ray/blob/master/src/ray/raylet/scheduling/cluster_task_manager.cc#L843); however in the case that the lease request is sent to a remote raylet (spilling, or locality aware scheduling), this might under count the resource requirement.

I think that line reference changed? But if you're referring to the worker backlog reporting, I think the leawse request will get counted, but at the other raylet? (there is a short race condition when it's in transit and not counted i guess).

simonsays1980 pushed a commit to simonsays1980/ray that referenced this pull request Feb 27, 2022
…into helper class (ray-project#22215)

Separate Scheduler Resource Reporting logic into a separate class for better readability and maintainability.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants