Autoscaler hotfix for #4555.#4653
Conversation
|
Test PASSed. |
|
@ls-daniel Can you verify if this fixes the autoscaler? Or would it be possible to share the autoscaler config that was failing? |
|
Test FAILed. |
|
Is this PR also able to fix #4655? |
|
This doesn't work. If we add that is, the static resources disappear, as I mentioned earlier. So you can't use the static_resources to imply a dynamic resource should be zero. The static resources should not change, a node with 1 CPU always has 1 CPU regardless of what's running on it. Start ray with --num-cpus=1, min_workers=0. Other configuration parameters shouldn't matter. You'll never scale the first worker, because the static resources disappear as soon as you do anything on the head node. |
|
If I revert the other commit it works fine. I'm fairly sure it's not an issue in the autoscaler. I've had a look at the commit to try and figure out which bit of the code is adjusting static resources. It makes sense to me that you may wish for a dynamic resource to disappear instead of being zero. A static resource should a) never be zero and b) never disappear, it's static :) |
|
Thanks @ls-daniel. I agree, static resources should indeed never be zero or disappear. The Lines 118 to 131 in be2cbdf This behavior is a bit unexpected since the |
python/ray/autoscaler/autoscaler.py
Outdated
| dynamic_resources_update[resource_name] = 0.0 | ||
| self.dynamic_resources_by_ip[ip] = dynamic_resources_update | ||
|
|
||
| self.dynamic_resources_by_ip[ip] = dynamic_resources |
There was a problem hiding this comment.
Line 168 and Line 166 do not make sense together, should 168 be removed?
There was a problem hiding this comment.
Yes I think so, thanks!
|
Test PASSed. |
|
Test FAILed. |
|
This change solved #4655 for me. |
|
I've tried with @Rnishihara's change and that doesn't work either. Same behaviour. The autoscaler on Ray master is broken for me, further changes to the resource mechanisms have meant reverting 4555 isn't easily possible any more either, so I'm stuck about 30 commits back now |
|
@ls-daniel thanks, @williamma12 is looking into it. |
|
@ls-daniel I was able to reproduce the error and pushed a fix that fixed it for me, can you let me know if it fixes it for you? |
|
Test PASSed. |
|
Test FAILed. |
|
@williamma12 this works perfectly cherry-picked into my branch - Thank you :) I'm now rebasing on master to test again and will report back. |
|
Thanks @williamma12! @ls-daniel The autoscaler is unfortunately not as well tested because it interacts with AWS or GCP. Do you have thoughts about how to improve the testing setup? |
What do these changes do?
These changes may fix the autoscaler bug introduced in #4555:
Linter
scripts/format.shto lint the changes in this PR.