[autoscaler] Initial support for multiple worker types by ericl · Pull Request #9096 · ray-project/ray

ericl · 2020-06-23T01:23:23Z

Why are these changes needed?

This is the initial PR for multiple worker types. It is not possible to use except through the request_resources() API. In the future, it will be integrated with the Ray scheduler to act upon resource demands directly (i.e., based on queue and placement group stats).

This PR builds on #9091 (please ignore changes from that)

Related issue number

#8649

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

AmplabJenkins · 2020-06-23T01:29:12Z

Can one of the admins verify this patch?

AmplabJenkins · 2020-06-23T02:36:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27453/
Test FAILed.

wuisawesome

Left a few comments, but it generally looks good and I'm excited to use it!

wuisawesome · 2020-06-23T21:21:04Z

python/ray/autoscaler/commands.py

+    return status
+
+
+def request_resources(num_cpus=None, bundles=None):


It seems weird to special case CPU's here. Wouldn't it make more sense to either (1) keep the same signature as the old autoscaler.request_resources (2) Make CPU part of the bundle, or (3) forget the bundle all together and just use kwargs.

Yeah agree but this is an experimental API so not worried about it. I updated the comment to say so.

wuisawesome · 2020-06-23T21:47:55Z

python/ray/autoscaler/resource_demand_scheduler.py

+
+        return node_resources, instance_type_counts
+
+    def get_instances_to_launch(self, nodes: List[str],


There's a lot of INFO level debugging going on here that should probably be DEBUG. I don't think we should generate any logs unless we're actually making an autoscaling decision/the cluster state changes.

I think this is fine since these logs aren't user visible and only run a few times a minute.. It probably at most doubles the number of log message for the autoscaler.

wuisawesome · 2020-06-23T21:53:09Z

python/ray/autoscaler/updater.py

            logger.info(self.log_prefix +
-                        "Running {} on {}...".format(cmd, self.ssh_ip))
-            logger.info("Begin remote output from {}".format(self.ssh_ip))
+                        "Running {}".format(" ".join(final_cmd)))


Can we not change this? Or if we are going to change it, at least keep the KubernetesCommandRunner output consistent? Or maybe just change this in a different PR.

I changed the kubectl one too. This is a nice incidental fix since it makes the command much more readable.

python/ray/autoscaler/autoscaler.py

python/ray/autoscaler/resource_demand_scheduler.py

stephanie-wang · 2020-06-24T02:28:27Z

python/ray/autoscaler/aws/example-auto-instance-type.yaml

+# Setting this configuration enables the Tune resource demand scheduler.
+available_instance_types:
+    m4.xlarge:
+        resources: {"CPU": 4}


Is it necessary to specify the resource specs for each instance type? Couldn't we look this information up? This seems like it could be brittle if the resource spec is inaccurate.

We can, but it would require additional API calls specific to each cloud or storing the metadata somewhere. It's probably easiest to specify it manually for now.

python/ray/autoscaler/commands.py

python/ray/autoscaler/aws/example-auto-instance-type.yaml

ericl · 2020-06-24T23:23:19Z

Updated types and docs.

AmplabJenkins · 2020-06-25T03:49:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27547/
Test PASSed.

AmplabJenkins · 2020-06-25T05:15:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/27539/
Test PASSed.

ericl added 12 commits June 22, 2020 14:19

wip

e0aef0c

fix

d1bdf24

update

d12756e

debug state

a976782

fix

f9d0559

update

70fb49b

update

362b9d2

fix test

05cb903

fix

4aec3e5

fix

3c0017d

Merge branch 'debug-autoscaler-str' into multi-worker-autoscaler

65d936f

update

4770efb

ericl assigned stephanie-wang and wuisawesome Jun 23, 2020

Merge remote-tracking branch 'upstream/master' into debug-autoscaler-str

4464169

wuisawesome approved these changes Jun 23, 2020

View reviewed changes

stephanie-wang reviewed Jun 24, 2020

View reviewed changes

ericl added 4 commits June 24, 2020 15:56

fix

6be7344

Merge branch 'debug-autoscaler-str' into multi-worker-autoscaler

1ceec90

types and docs

2f6474a

update

c1575f0

ericl added 2 commits June 24, 2020 18:22

Merge remote-tracking branch 'upstream/master' into debug-autoscaler-str

d602550

Merge branch 'debug-autoscaler-str' into multi-worker-autoscaler

3ce0c65

ericl merged commit 536795e into ray-project:master Jun 25, 2020

PidgeyBE mentioned this pull request Jul 2, 2020

[autoscaler] Service and Ingress resource per worker #9042

Closed

		return status


		def request_resources(num_cpus=None, bundles=None):


		return node_resources, instance_type_counts

		def get_instances_to_launch(self, nodes: List[str],

Conversation

ericl commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

AmplabJenkins commented Jun 23, 2020

Uh oh!

AmplabJenkins commented Jun 23, 2020

Uh oh!

wuisawesome left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ericl commented Jun 24, 2020

Uh oh!

AmplabJenkins commented Jun 25, 2020

Uh oh!

AmplabJenkins commented Jun 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericl commented Jun 23, 2020 •

edited

Loading