Skip to content

[Core] Report cluster config from autoscaler#49568

Merged
jjyao merged 11 commits intoray-project:masterfrom
jjyao:jjyao/sstaticc
Jan 8, 2025
Merged

[Core] Report cluster config from autoscaler#49568
jjyao merged 11 commits intoray-project:masterfrom
jjyao:jjyao/sstaticc

Conversation

@jjyao
Copy link
Copy Markdown
Contributor

@jjyao jjyao commented Jan 3, 2025

Why are these changes needed?

Example ClusterConfig are:

ClusterConfig {
  max_resources: {}
  node_group_configs = [NodeGroupConfig {
    resources: {"CPU": 1, "GPU": 1}
    max_num_nodes: 1
  }, NodeGroupConfig {
    resources: {"CPU": 16},
    max_num_nodes: 4
  }]
}
ClusterConfig {
  max_resources: {"CPU": 10}
  node_group_configs = [NodeGroupConfig {
    resources: {"CPU": 1, "GPU": 1}
    max_num_nodes: 1
  }, NodeGroupConfig {
    resources: {"CPU": 16},
    max_num_nodes: 4
  }]
}

Libraries can use this information to do better autoscaling decisions.

Related issue number

Closes #49501

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
}

message NodeGroupConfig {
map<string, double> resources = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


message NodeGroupConfig {
map<string, double> resources = 1;
optional uint64 max_num_nodes = 2;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Count as

    // The minimum number of instances to launch.
    int32 min_count = 3;
    // The maximum number of instances to launch.
    int32 max_count = 4;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be negative?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 for max means infinite. Let's sync offline first and decide how to move forward.


message ClusterConfig {
// Max resources for the entire cluster.
map<string, double> max_resources = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cluster level resources in Anyscale:

    // If set, the global resource minimum in the cluster (based on a Ray defined or custom resource).
    map<string, int64> min_resources = 106;

    // If set, the global resource maximum in the cluster (based on a Ray defined or custom resource).
    map<string, int64> max_resources = 107;

Copy link
Copy Markdown
Contributor

@brucez-anyscale brucez-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general lgtm. Pls sync the ray proto with Anyscale proto types, namely int64 and int32

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Jan 5, 2025
jjyao added 5 commits January 5, 2025 10:04
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao marked this pull request as ready for review January 6, 2025 22:10
@jjyao jjyao requested review from a team and hongchaodeng as code owners January 6, 2025 22:10
@kevin85421 kevin85421 self-assigned this Jan 8, 2025
Copy link
Copy Markdown
Contributor

@brucez-anyscale brucez-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto lgtm.

@jjyao jjyao merged commit 7ffad78 into ray-project:master Jan 8, 2025
@jjyao jjyao deleted the jjyao/sstaticc branch January 8, 2025 05:19
roshankathawate pushed a commit to roshankathawate/ray that referenced this pull request Jan 9, 2025
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Roshan Kathawate <roshankathawate@gmail.com>
HYLcool pushed a commit to HYLcool/ray that referenced this pull request Jan 13, 2025
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: lielin.hyl <lielin.hyl@alibaba-inc.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-backlog go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core][Autoscaler] API for getting max cluster resources

5 participants