Skip to content

Autoscaler hotfix for #4555.#4653

Merged
robertnishihara merged 4 commits intoray-project:masterfrom
romilbhardwaj:autoscaler_hotfix
May 8, 2019
Merged

Autoscaler hotfix for #4555.#4653
robertnishihara merged 4 commits intoray-project:masterfrom
romilbhardwaj:autoscaler_hotfix

Conversation

@romilbhardwaj
Copy link
Copy Markdown
Member

What do these changes do?

These changes may fix the autoscaler bug introduced in #4555:

With a head node with num-cpus=1, this commit results in the autoscaler LoadMetrics reporting an empty dict for both static and dynamic resouces once 1 job is scheduled on the head node.`

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/500/
Test PASSed.

@romilbhardwaj
Copy link
Copy Markdown
Member Author

@ls-daniel Can you verify if this fixes the autoscaler? Or would it be possible to share the autoscaler config that was failing?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13834/
Test FAILed.

@jovany-wang
Copy link
Copy Markdown
Contributor

Is this PR also able to fix #4655?

@ls-daniel
Copy link
Copy Markdown
Contributor

ls-daniel commented Apr 18, 2019

This doesn't work.

If we add print("static_resources", static_resources) to that function, and start running some tasks, the logs look like this:

2019-04-18 13:55:00,805 DEBUG static_resources {b'CPU': 1.0}
2019-04-18 13:55:00,921 DEBUG static_resources {b'CPU': 1.0}
2019-04-18 13:55:01,030 DEBUG static_resources {}
2019-04-18 13:55:01,142 DEBUG static_resources {}
2019-04-18 13:55:01,253 DEBUG static_resources {}

that is, the static resources disappear, as I mentioned earlier. So you can't use the static_resources to imply a dynamic resource should be zero. The static resources should not change, a node with 1 CPU always has 1 CPU regardless of what's running on it.

Start ray with --num-cpus=1, min_workers=0. Other configuration parameters shouldn't matter.

You'll never scale the first worker, because the static resources disappear as soon as you do anything on the head node.

@ls-daniel
Copy link
Copy Markdown
Contributor

If I revert the other commit it works fine. I'm fairly sure it's not an issue in the autoscaler. I've had a look at the commit to try and figure out which bit of the code is adjusting static resources.

It makes sense to me that you may wish for a dynamic resource to disappear instead of being zero. A static resource should a) never be zero and b) never disappear, it's static :)

@romilbhardwaj
Copy link
Copy Markdown
Member Author

Thanks @ls-daniel. I agree, static resources should indeed never be zero or disappear.

The LoadMetrics.update method is called from monitor.py, which updates it with values from the raylet heartbeats:

ray/python/ray/monitor.py

Lines 118 to 131 in be2cbdf

for i in range(num_resources):
dyn = heartbeat_message.ResourcesAvailableLabel(i)
static = heartbeat_message.ResourcesTotalLabel(i)
dynamic_resources[dyn] = (
heartbeat_message.ResourcesAvailableCapacity(i))
static_resources[static] = (
heartbeat_message.ResourcesTotalCapacity(i))
# Update the load metrics for this raylet.
client_id = ray.utils.binary_to_hex(heartbeat_message.ClientId())
ip = self.raylet_id_to_ip_map.get(client_id)
if ip:
self.load_metrics.update(ip, static_resources,
dynamic_resources)

This behavior is a bit unexpected since the resources_total_ in cluster_resource_map is never modified once created from the GCS entry, and thus ResourcesTotalCapacity in the heartbeat should never change. cc @robertnishihara

@romilbhardwaj
Copy link
Copy Markdown
Member Author

Is this PR also able to fix #4655?

Yes, I think this is related to #4655 and should have fixed it.

dynamic_resources_update[resource_name] = 0.0
self.dynamic_resources_by_ip[ip] = dynamic_resources_update

self.dynamic_resources_by_ip[ip] = dynamic_resources
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 168 and Line 166 do not make sense together, should 168 be removed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think so, thanks!

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/537/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13886/
Test FAILed.

@nikola-j
Copy link
Copy Markdown

This change solved #4655 for me.

@ls-daniel
Copy link
Copy Markdown
Contributor

I've tried with @Rnishihara's change and that doesn't work either. Same behaviour.

The autoscaler on Ray master is broken for me, further changes to the resource mechanisms have meant reverting 4555 isn't easily possible any more either, so I'm stuck about 30 commits back now

@robertnishihara
Copy link
Copy Markdown
Collaborator

@ls-daniel thanks, @williamma12 is looking into it.

@williamma12
Copy link
Copy Markdown
Contributor

@ls-daniel I was able to reproduce the error and pushed a fix that fixed it for me, can you let me know if it fixes it for you?

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/678/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/14082/
Test FAILed.

@ls-daniel
Copy link
Copy Markdown
Contributor

@williamma12 this works perfectly cherry-picked into my branch - Thank you :)

I'm now rebasing on master to test again and will report back.

@robertnishihara robertnishihara merged commit 0421cba into ray-project:master May 8, 2019
@robertnishihara
Copy link
Copy Markdown
Collaborator

Thanks @williamma12! @ls-daniel The autoscaler is unfortunately not as well tested because it interacts with AWS or GCP. Do you have thoughts about how to improve the testing setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants