Skip to content

[autoscaler] Initial support for multiple worker types#9096

Merged
ericl merged 19 commits intoray-project:masterfrom
ericl:multi-worker-autoscaler
Jun 25, 2020
Merged

[autoscaler] Initial support for multiple worker types#9096
ericl merged 19 commits intoray-project:masterfrom
ericl:multi-worker-autoscaler

Conversation

@ericl
Copy link
Copy Markdown
Contributor

@ericl ericl commented Jun 23, 2020

Why are these changes needed?

This is the initial PR for multiple worker types. It is not possible to use except through the request_resources() API. In the future, it will be integrated with the Ray scheduler to act upon resource demands directly (i.e., based on queue and placement group stats).

This PR builds on #9091 (please ignore changes from that)

Related issue number

#8649

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27453/
Test FAILed.

Copy link
Copy Markdown
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but it generally looks good and I'm excited to use it!

return status


def request_resources(num_cpus=None, bundles=None):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to special case CPU's here. Wouldn't it make more sense to either (1) keep the same signature as the old autoscaler.request_resources (2) Make CPU part of the bundle, or (3) forget the bundle all together and just use kwargs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree but this is an experimental API so not worried about it. I updated the comment to say so.


return node_resources, instance_type_counts

def get_instances_to_launch(self, nodes: List[str],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of INFO level debugging going on here that should probably be DEBUG. I don't think we should generate any logs unless we're actually making an autoscaling decision/the cluster state changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine since these logs aren't user visible and only run a few times a minute.. It probably at most doubles the number of log message for the autoscaler.

logger.info(self.log_prefix +
"Running {} on {}...".format(cmd, self.ssh_ip))
logger.info("Begin remote output from {}".format(self.ssh_ip))
"Running {}".format(" ".join(final_cmd)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not change this? Or if we are going to change it, at least keep the KubernetesCommandRunner output consistent? Or maybe just change this in a different PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the kubectl one too. This is a nice incidental fix since it makes the command much more readable.

# Setting this configuration enables the Tune resource demand scheduler.
available_instance_types:
m4.xlarge:
resources: {"CPU": 4}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to specify the resource specs for each instance type? Couldn't we look this information up? This seems like it could be brittle if the resource spec is inaccurate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but it would require additional API calls specific to each cloud or storing the metadata somewhere. It's probably easiest to specify it manually for now.

@ericl
Copy link
Copy Markdown
Contributor Author

ericl commented Jun 24, 2020

Updated types and docs.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27547/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27539/
Test PASSed.

@ericl ericl merged commit 536795e into ray-project:master Jun 25, 2020
This was referenced Sep 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants