[tune] Change `<= 0` to `< 0` in over-commit check by hartikainen · Pull Request #4600 · ray-project/ray

hartikainen · 2019-04-11T00:35:56Z

I was trying to run Tune experiment where my head node had 0 gpus and workers have > 0 gpus using queue_trials=True. I expected the trials (which required 1 gpu each) to be queued until the first worker comes up, but instead the cluster failed with:

ray.tune.error.TuneError: Insufficient cluster resources to launch trial: trial requested 4 CPUs, 1.0 GPUs but the cluster has only 4 CPUs, 0 GPUs. Pass queue_trials=True in ray.tune.run() or on the command line to queue trials until the cluster scales up.

If I understand the Tune resource usage correctly, the overcommit check in trial executor should check strict inequality when deciding whether the cluster is saturated. Correct me if I'm wrong.

Linter

I've run scripts/format.sh to lint the changes in this PR.

AmplabJenkins · 2019-04-11T00:37:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/422/
Test PASSed.

richardliaw · 2019-04-11T00:56:58Z

cc @ericl who is more familiar with this code

hartikainen · 2019-04-11T01:02:46Z

I was hoping to get around this by saying ray start --head --num-gpus=0.01, but that gives me Error: Invalid value for "--num-gpus": 0.01 is not a valid integer.

Another workaround could be to say ray start --head --num-gpus=1 --num-cpus=1 and have trials take > 1 cpu. This leads to the following tune resource allocation:

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 2/1 CPUs, 0.5/1 GPUs
Memory usage on this node: 1.0/15.8 GB

Ray does not allocate anything though and the autoscaler thinks that there's no resource usage:

2019-04-11 01:01:05,030 INFO autoscaler.py:625 -- StandardAutoscaler: 0/0 target nodes (0 pending)

which is why the cluster doesn't scale at all.

That said, I don't actually know if this PR would fix the issue.

ericl · 2019-04-11T02:01:07Z

This won't actually trigger scaling up, unless you set min worker nodes =1. This is a limitation of the scheduler right now.

hartikainen · 2019-04-11T02:07:43Z

Specifying initial_workers > 1 together with ray start --num-gpus=1 --num-cpus=1 ... allows a trial to be started on a worker, but things still hang.

@ericl Do you know any workaround for running cluster with non-gpu head node and gpu workers?

Edit: Nevermind, the worker starts running fine with this setup after all. For some reason the training iteration took almost 15 minutes which got me confused.

ericl · 2019-04-11T02:11:47Z

I don't think you need to do anything special, in fact you shouldn't set num CPUs or GPUs. If the trial fits on the worker node then it should run.

…

On Wed, Apr 10, 2019, 7:07 PM Kristian Hartikainen ***@***.***> wrote: Specifying initial_workers > 1 together with ray start --num-gpus=1 --num-cpus=1 ... allows a trial to be started on a worker, but things still hang. @ericl <https://github.com/ericl> Do you know any workaround for running cluster with non-gpu head node and gpu workers? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4600 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SulpbRnbnuMoE3fDbJjfEFZeOn_eks5vfpj4gaJpZM4cogeP> .

AmplabJenkins · 2019-04-11T03:02:13Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13712/
Test FAILed.

richardliaw · 2019-05-02T17:20:22Z

I think we can close this for now as it seems like it's not a right fix?

hartikainen added 2 commits April 10, 2019 17:27

Change <= 0 to < 0 in over-commit check

78bc097

Lint

be5f248

hartikainen requested a review from richardliaw April 11, 2019 00:40

richardliaw requested a review from ericl April 11, 2019 00:56

hartikainen changed the title ~~Change <= 0 to < 0 in over-commit check~~ [tune] Change <= 0 to < 0 in over-commit check Apr 11, 2019

richardliaw closed this May 2, 2019

stevenlin1111 mentioned this pull request May 14, 2019

[Autoscaler] CPU head node with GPU worker nodes #4784

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Change `<= 0` to `< 0` in over-commit check#4600

[tune] Change `<= 0` to `< 0` in over-commit check#4600
hartikainen wants to merge 2 commits intoray-project:masterfrom
hartikainen:fix-overcommit-check

hartikainen commented Apr 11, 2019

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

richardliaw commented Apr 11, 2019

Uh oh!

hartikainen commented Apr 11, 2019 •

edited

Loading

Uh oh!

ericl commented Apr 11, 2019

Uh oh!

hartikainen commented Apr 11, 2019 •

edited

Loading

Uh oh!

ericl commented Apr 11, 2019 via email

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

richardliaw commented May 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hartikainen commented Apr 11, 2019

Linter

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

richardliaw commented Apr 11, 2019

Uh oh!

hartikainen commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Apr 11, 2019

Uh oh!

hartikainen commented Apr 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ericl commented Apr 11, 2019 via email

Uh oh!

AmplabJenkins commented Apr 11, 2019

Uh oh!

richardliaw commented May 2, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hartikainen commented Apr 11, 2019 •

edited

Loading

hartikainen commented Apr 11, 2019 •

edited

Loading