[autoscaler] Initial support for multiple worker types#9096
[autoscaler] Initial support for multiple worker types#9096ericl merged 19 commits intoray-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
wuisawesome
left a comment
There was a problem hiding this comment.
Left a few comments, but it generally looks good and I'm excited to use it!
| return status | ||
|
|
||
|
|
||
| def request_resources(num_cpus=None, bundles=None): |
There was a problem hiding this comment.
It seems weird to special case CPU's here. Wouldn't it make more sense to either (1) keep the same signature as the old autoscaler.request_resources (2) Make CPU part of the bundle, or (3) forget the bundle all together and just use kwargs.
There was a problem hiding this comment.
Yeah agree but this is an experimental API so not worried about it. I updated the comment to say so.
|
|
||
| return node_resources, instance_type_counts | ||
|
|
||
| def get_instances_to_launch(self, nodes: List[str], |
There was a problem hiding this comment.
There's a lot of INFO level debugging going on here that should probably be DEBUG. I don't think we should generate any logs unless we're actually making an autoscaling decision/the cluster state changes.
There was a problem hiding this comment.
I think this is fine since these logs aren't user visible and only run a few times a minute.. It probably at most doubles the number of log message for the autoscaler.
| logger.info(self.log_prefix + | ||
| "Running {} on {}...".format(cmd, self.ssh_ip)) | ||
| logger.info("Begin remote output from {}".format(self.ssh_ip)) | ||
| "Running {}".format(" ".join(final_cmd))) |
There was a problem hiding this comment.
Can we not change this? Or if we are going to change it, at least keep the KubernetesCommandRunner output consistent? Or maybe just change this in a different PR.
There was a problem hiding this comment.
I changed the kubectl one too. This is a nice incidental fix since it makes the command much more readable.
| # Setting this configuration enables the Tune resource demand scheduler. | ||
| available_instance_types: | ||
| m4.xlarge: | ||
| resources: {"CPU": 4} |
There was a problem hiding this comment.
Is it necessary to specify the resource specs for each instance type? Couldn't we look this information up? This seems like it could be brittle if the resource spec is inaccurate.
There was a problem hiding this comment.
We can, but it would require additional API calls specific to each cloud or storing the metadata somewhere. It's probably easiest to specify it manually for now.
|
Updated types and docs. |
|
Test PASSed. |
|
Test PASSed. |
Why are these changes needed?
This is the initial PR for multiple worker types. It is not possible to use except through the
request_resources()API. In the future, it will be integrated with the Ray scheduler to act upon resource demands directly (i.e., based on queue and placement group stats).This PR builds on #9091 (please ignore changes from that)
Related issue number
#8649
Checks
scripts/format.shto lint the changes in this PR.