[tune] Change <= 0 to < 0 in over-commit check#4600
[tune] Change <= 0 to < 0 in over-commit check#4600hartikainen wants to merge 2 commits intoray-project:masterfrom
<= 0 to < 0 in over-commit check#4600Conversation
|
Test PASSed. |
|
cc @ericl who is more familiar with this code |
|
I was hoping to get around this by saying Another workaround could be to say Ray does not allocate anything though and the autoscaler thinks that there's no resource usage: which is why the cluster doesn't scale at all. That said, I don't actually know if this PR would fix the issue. |
<= 0 to < 0 in over-commit check<= 0 to < 0 in over-commit check
|
This won't actually trigger scaling up, unless you set min worker nodes =1. This is a limitation of the scheduler right now. |
|
Specifying @ericl Do you know any workaround for running cluster with non-gpu head node and gpu workers? Edit: Nevermind, the worker starts running fine with this setup after all. For some reason the training iteration took almost 15 minutes which got me confused. |
|
I don't think you need to do anything special, in fact you shouldn't set
num CPUs or GPUs. If the trial fits on the worker node then it should run.
…On Wed, Apr 10, 2019, 7:07 PM Kristian Hartikainen ***@***.***> wrote:
Specifying initial_workers > 1 together with ray start --num-gpus=1
--num-cpus=1 ... allows a trial to be started on a worker, but things
still hang.
@ericl <https://github.com/ericl> Do you know any workaround for running
cluster with non-gpu head node and gpu workers?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4600 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAA6SulpbRnbnuMoE3fDbJjfEFZeOn_eks5vfpj4gaJpZM4cogeP>
.
|
|
Test FAILed. |
|
I think we can close this for now as it seems like it's not a right fix? |
I was trying to run Tune experiment where my head node had 0 gpus and workers have > 0 gpus using
queue_trials=True. I expected the trials (which required 1 gpu each) to be queued until the first worker comes up, but instead the cluster failed with:If I understand the Tune resource usage correctly, the overcommit check in trial executor should check strict inequality when deciding whether the cluster is saturated. Correct me if I'm wrong.
Linter
scripts/format.shto lint the changes in this PR.