Skip to content

[tune] Change <= 0 to < 0 in over-commit check#4600

Closed
hartikainen wants to merge 2 commits intoray-project:masterfrom
hartikainen:fix-overcommit-check
Closed

[tune] Change <= 0 to < 0 in over-commit check#4600
hartikainen wants to merge 2 commits intoray-project:masterfrom
hartikainen:fix-overcommit-check

Conversation

@hartikainen
Copy link
Copy Markdown
Contributor

I was trying to run Tune experiment where my head node had 0 gpus and workers have > 0 gpus using queue_trials=True. I expected the trials (which required 1 gpu each) to be queued until the first worker comes up, but instead the cluster failed with:

ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 4 CPUs, 1.0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.

If I understand the Tune resource usage correctly, the overcommit check in trial executor should check strict inequality when deciding whether the cluster is saturated. Correct me if I'm wrong.

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/422/
Test PASSed.

@hartikainen hartikainen requested a review from richardliaw April 11, 2019 00:40
@richardliaw richardliaw requested a review from ericl April 11, 2019 00:56
@richardliaw
Copy link
Copy Markdown
Contributor

cc @ericl who is more familiar with this code

@hartikainen
Copy link
Copy Markdown
Contributor Author

hartikainen commented Apr 11, 2019

I was hoping to get around this by saying ray start --head --num-gpus=0.01, but that gives me Error: Invalid value for "--num-gpus": 0.01 is not a valid integer.

Another workaround could be to say ray start --head --num-gpus=1 --num-cpus=1 and have trials take > 1 cpu. This leads to the following tune resource allocation:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 2/1 CPUs, 0.5/1 GPUs
Memory usage on this node: 1.0/15.8 GB

Ray does not allocate anything though and the autoscaler thinks that there's no resource usage:

2019-04-11 01:01:05,030 INFO autoscaler.py:625 -- StandardAutoscaler: 0/0 target nodes (0 pending)

which is why the cluster doesn't scale at all.

That said, I don't actually know if this PR would fix the issue.

@hartikainen hartikainen changed the title Change <= 0 to < 0 in over-commit check [tune] Change <= 0 to < 0 in over-commit check Apr 11, 2019
@ericl
Copy link
Copy Markdown
Contributor

ericl commented Apr 11, 2019

This won't actually trigger scaling up, unless you set min worker nodes =1. This is a limitation of the scheduler right now.

@hartikainen
Copy link
Copy Markdown
Contributor Author

hartikainen commented Apr 11, 2019

Specifying initial_workers > 1 together with ray start --num-gpus=1 --num-cpus=1 ... allows a trial to be started on a worker, but things still hang.

@ericl Do you know any workaround for running cluster with non-gpu head node and gpu workers?

Edit: Nevermind, the worker starts running fine with this setup after all. For some reason the training iteration took almost 15 minutes which got me confused.

@ericl
Copy link
Copy Markdown
Contributor

ericl commented Apr 11, 2019 via email

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13712/
Test FAILed.

@richardliaw
Copy link
Copy Markdown
Contributor

I think we can close this for now as it seems like it's not a right fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants