[autoscalerv2] use replicas in workerGroupSpecs as current workers number when initialize scale request to fix scale up target is wrong#47967
Conversation
…mber when initialize scale request to fix scale up target is wrong Signed-off-by: dujunling <dujunling@bytedance.com>
when ray request a lot of workers, the autoscaler calculates how many workers need to be started, and adds the current number of instances to update the number of replicas in the CR. However, the current number of instances lags behind the expected number of instances in the previous round because some workers have not yet been officially started. As a result, the target number of instances in this round is incorrect. |
I see - so just so that I understand correctly, you are saying that the "current instances" we pulled from the API server here might be out-dated? IIUC, the Am I missing something here? |
Yes, current instance from api-server is small than replicas in CR. Then the target should be replicas + increase number |
Ah I see, this makes sense to me! Good point. (also cc @kevin85421 ) Could you add more comments to the change so that it's clear for future references? |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
|
not stale |
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.