Skip to content

[autoscalerv2] use replicas in workerGroupSpecs as current workers number when initialize scale request to fix scale up target is wrong#47967

Closed
dujl wants to merge 1 commit intoray-project:masterfrom
dujl:scale-slow
Closed

[autoscalerv2] use replicas in workerGroupSpecs as current workers number when initialize scale request to fix scale up target is wrong#47967
dujl wants to merge 1 commit intoray-project:masterfrom
dujl:scale-slow

Conversation

@dujl
Copy link
Copy Markdown

@dujl dujl commented Oct 10, 2024

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@dujl dujl requested review from a team and hongchaodeng as code owners October 10, 2024 07:40
…mber when initialize scale request to fix scale up target is wrong

Signed-off-by: dujunling <dujunling@bytedance.com>
@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) core-autoscaler autoscaler related issues core Issues that should be addressed in Ray Core labels Oct 16, 2024
@jjyao jjyao removed the triage Needs triage (eg: priority, bug/not-bug, and owning component) label Oct 28, 2024
@jjyao jjyao self-assigned this Oct 28, 2024
Copy link
Copy Markdown
Member

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dujl - thanks for the PR, would you mind adding some more details on the bug/fix here?

@dujl
Copy link
Copy Markdown
Author

dujl commented Nov 7, 2024

Hey @dujl - thanks for the PR, would you mind adding some more details on the bug/fix here?

when ray request a lot of workers, the autoscaler calculates how many workers need to be started, and adds the current number of instances to update the number of replicas in the CR. However, the current number of instances lags behind the expected number of instances in the previous round because some workers have not yet been officially started. As a result, the target number of instances in this round is incorrect.

@rickyyx
Copy link
Copy Markdown
Member

rickyyx commented Nov 7, 2024

Hey @dujl - thanks for the PR, would you mind adding some more details on the bug/fix here?

when ray request a lot of workers, the autoscaler calculates how many workers need to be started, and adds the current number of instances to update the number of replicas in the CR. However, the current number of instances lags behind the expected number of instances in the previous round because some workers have not yet been officially started. As a result, the target number of instances in this round is incorrect.

I see - so just so that I understand correctly, you are saying that the "current instances" we pulled from the API server here might be out-dated?
https://github.com/ray-project/ray/pull/47967/files#diff-23efc7e6b062e7701d8e1c8727079203e4250abd671e5821f2dbb86f4de85abaL208

IIUC, the current_instances should translate to the existing number of replicas that you are using here from ray_cluster?

Am I missing something here?

@dujl
Copy link
Copy Markdown
Author

dujl commented Nov 8, 2024

Hey @dujl - thanks for the PR, would you mind adding some more details on the bug/fix here?

when ray request a lot of workers, the autoscaler calculates how many workers need to be started, and adds the current number of instances to update the number of replicas in the CR. However, the current number of instances lags behind the expected number of instances in the previous round because some workers have not yet been officially started. As a result, the target number of instances in this round is incorrect.

I see - so just so that I understand correctly, you are saying that the "current instances" we pulled from the API server here might be out-dated? https://github.com/ray-project/ray/pull/47967/files#diff-23efc7e6b062e7701d8e1c8727079203e4250abd671e5821f2dbb86f4de85abaL208

IIUC, the current_instances should translate to the existing number of replicas that you are using here from ray_cluster?

Am I missing something here?

Yes, current instance from api-server is small than replicas in CR. Then the target should be replicas + increase number

@rickyyx
Copy link
Copy Markdown
Member

rickyyx commented Nov 8, 2024

Hey @dujl - thanks for the PR, would you mind adding some more details on the bug/fix here?

when ray request a lot of workers, the autoscaler calculates how many workers need to be started, and adds the current number of instances to update the number of replicas in the CR. However, the current number of instances lags behind the expected number of instances in the previous round because some workers have not yet been officially started. As a result, the target number of instances in this round is incorrect.

I see - so just so that I understand correctly, you are saying that the "current instances" we pulled from the API server here might be out-dated? https://github.com/ray-project/ray/pull/47967/files#diff-23efc7e6b062e7701d8e1c8727079203e4250abd671e5821f2dbb86f4de85abaL208
IIUC, the current_instances should translate to the existing number of replicas that you are using here from ray_cluster?
Am I missing something here?

Yes, current instance from api-server is small than replicas in CR. Then the target should be replicas + increase number

Ah I see, this makes sense to me! Good point. (also cc @kevin85421 )

Could you add more comments to the change so that it's clear for future references?
Also, it would be nice to add some testing info (doesn't have to be unit test, but some description of the actual repro you ran into, and how the PR solved this).

@kevin85421 kevin85421 changed the title [autoscalerv2] use replicas in workerGroupSpecs as current workers nu… [autoscalerv2] use replicas in workerGroupSpecs as current workers number when initialize scale request to fix scale up target is wrong Nov 9, 2024
@kevin85421 kevin85421 self-assigned this Nov 9, 2024
@stale
Copy link
Copy Markdown

stale bot commented Feb 1, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Feb 1, 2025
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Apr 4, 2025
@stale
Copy link
Copy Markdown

stale bot commented May 6, 2025

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 6, 2025
@kevin85421 kevin85421 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 6, 2025
@kevin85421
Copy link
Copy Markdown
Member

not stale

@jjyao jjyao assigned rueian and unassigned jjyao and rickyyx May 13, 2025
@kevin85421 kevin85421 closed this May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants