Skip to content

[Serve] Group DeploymentHandle autoscaling metrics pushes by process#45957

Closed
JoshKarpel wants to merge 9 commits intoray-project:masterfrom
JoshKarpel:issue-45777-amortize-metrics-pushing-v2
Closed

[Serve] Group DeploymentHandle autoscaling metrics pushes by process#45957
JoshKarpel wants to merge 9 commits intoray-project:masterfrom
JoshKarpel:issue-45777-amortize-metrics-pushing-v2

Conversation

@JoshKarpel
Copy link
Copy Markdown
Contributor

@JoshKarpel JoshKarpel commented Jun 14, 2024

Why are these changes needed?

We're seeing a lot of pressure on the Serve controller from metrics push tasks when running thousands of Serve apps. A lot of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We might be able to amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.

We've made this change on our setup and it has reduced CPU time spent on this on the Controller, and also our ingress application replicas that have all the handles.

Related issue number

Closes #45777

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
…shing-v2

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
Copy link
Copy Markdown

@venkatkalluru venkatkalluru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to understand a bit more, some nit comments.

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@JoshKarpel JoshKarpel changed the title Group DeploymentHandle autoscaling metrics pushes by process [Serve] Group DeploymentHandle autoscaling metrics pushes by process Jun 17, 2024
@anyscalesam anyscalesam added triage Needs triage (eg: priority, bug/not-bug, and owning component) serve Ray Serve Related Issue labels Aug 8, 2024
logger.info(
f"Registered {self._handle_id} with shared metrics pusher {shared}."
)
else:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this else block - it seems like even if RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE = False we still register the pushing task? It just won't have any data because the _add_autoscaling_metrics_point task isn't registered? 🤔

self.metrics_pusher.register_or_update_task(
self.PUSH_METRICS_TO_CONTROLLER_TASK_NAME,
self.push_autoscaling_metrics_to_controller,
autoscaling_config.metrics_interval_s,
Copy link
Copy Markdown
Contributor Author

@JoshKarpel JoshKarpel Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zcin @edoakes I'm working on resurrecting this PR and I think this was the main sticking point - the shared metrics pusher wouldn't respect the autoscaling_config.metrics_interval_s in the current form of the PR. Can we deprecate that parameter? Would we need to feature-flag this PR so it can be introduced gradually? Or maybe we want to preserve it and have a different shared pusher for each metric_interval_s?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with making the metrics_interval_s a cluster-level option configured via env var. WDYT Cindy?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I like that option!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think making the metrics interval a cluster level option makes sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - I will take that approach when I get back from the holidays

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it turns out there already is an env var for this, HANDLE_METRIC_PUSH_INTERVAL_S

# Handle metric push interval. (This interval will affect the cold start time period)
HANDLE_METRIC_PUSH_INTERVAL_S = float(
os.environ.get("RAY_SERVE_HANDLE_METRIC_PUSH_INTERVAL_S", "10")
)
, and I'm already using it
HANDLE_METRIC_PUSH_INTERVAL_S,

The autoscaling_config.metrics_interval_s is also used in a variety of other places, like for controlling how often metrics are recorded

min(
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_RECORD_PERIOD_S,
autoscaling_config.metrics_interval_s,
),

Should I remove all of those and just use the single env var, or keep the other uses and only "ignore" the setting here for the shared pusher?

I'm inclined to remove all of them for consistency but that's a much bigger blast radius...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove autoscaling_config.metrics_interval_s and just use RAY_SERVE_HANDLE_AUTOSCALING_METRIC_RECORD_PERIOD_S for recording metrics, because the interval at which we record metrics is usually more frequent than pushing metrics to the controller. And then we can use HANDLE_METRIC_PUSH_INTERVAL_S, but perhaps renaming it to include "autoscaling" would be better going forward, if it is to replace autoscaling_config.metrics_interval_s?

# Conflicts:
#	python/ray/serve/_private/router.py
zcin pushed a commit that referenced this pull request Jan 16, 2025
## Why are these changes needed?

In our use case we use Ray Serve with many hundreds/thousands of apps,
plus a "router" app that routes traffic to those apps using
`DeploymentHandle`s. Right now, that means we have a `LongPollClient`
for each `DeploymentHandle` in each router app replica, which could be
tens or hundreds of thousands of `LongPollClient`s. This is expensive on
both the Serve Controller and on the router app replicas. It can be
particularly problematic in resource usage on the Serve Controller - the
main thing blocking us from having as many router replicas as we'd like
is the stability of the controller.

This PR aims to amortize this cost of having so many `LongPollClient`s
by going from one-long-poll-client-per-handle to
one-long-poll-client-per-process. Each `DeploymentHandle`'s `Router` now
registers itself with a shared `LongPollClient` held by a singleton.

The actual implementation that I've gone with is a bit clunky because
I'm trying to bridge the gap between the current solution and a design
that *only* has shared `LongPollClient`s. This could potentially be
cleaned up in the future. Right now, each `Router` still gets a
dedicated `LongPollClient` that only runs temporarily, until the shared
client tells it to stop.

Related: #45957 is the same idea
but for handle autoscaling metrics pushing.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
srinathk10 pushed a commit that referenced this pull request Feb 2, 2025
## Why are these changes needed?

In our use case we use Ray Serve with many hundreds/thousands of apps,
plus a "router" app that routes traffic to those apps using
`DeploymentHandle`s. Right now, that means we have a `LongPollClient`
for each `DeploymentHandle` in each router app replica, which could be
tens or hundreds of thousands of `LongPollClient`s. This is expensive on
both the Serve Controller and on the router app replicas. It can be
particularly problematic in resource usage on the Serve Controller - the
main thing blocking us from having as many router replicas as we'd like
is the stability of the controller.

This PR aims to amortize this cost of having so many `LongPollClient`s
by going from one-long-poll-client-per-handle to
one-long-poll-client-per-process. Each `DeploymentHandle`'s `Router` now
registers itself with a shared `LongPollClient` held by a singleton.

The actual implementation that I've gone with is a bit clunky because
I'm trying to bridge the gap between the current solution and a design
that *only* has shared `LongPollClient`s. This could potentially be
cleaned up in the future. Right now, each `Router` still gets a
dedicated `LongPollClient` that only runs temporarily, until the shared
client tells it to stop.

Related: #45957 is the same idea
but for handle autoscaling metrics pushing.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
park12sj pushed a commit to park12sj/ray that referenced this pull request Mar 18, 2025
## Why are these changes needed?

In our use case we use Ray Serve with many hundreds/thousands of apps,
plus a "router" app that routes traffic to those apps using
`DeploymentHandle`s. Right now, that means we have a `LongPollClient`
for each `DeploymentHandle` in each router app replica, which could be
tens or hundreds of thousands of `LongPollClient`s. This is expensive on
both the Serve Controller and on the router app replicas. It can be
particularly problematic in resource usage on the Serve Controller - the
main thing blocking us from having as many router replicas as we'd like
is the stability of the controller.

This PR aims to amortize this cost of having so many `LongPollClient`s
by going from one-long-poll-client-per-handle to
one-long-poll-client-per-process. Each `DeploymentHandle`'s `Router` now
registers itself with a shared `LongPollClient` held by a singleton.

The actual implementation that I've gone with is a bit clunky because
I'm trying to bridge the gap between the current solution and a design
that *only* has shared `LongPollClient`s. This could potentially be
cleaned up in the future. Right now, each `Router` still gets a
dedicated `LongPollClient` that only runs temporarily, until the shared
client tells it to stop.

Related: ray-project#45957 is the same idea
but for handle autoscaling metrics pushing.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>
@hainesmichaelc hainesmichaelc added the community-contribution Contributed by the community label Apr 4, 2025
@akshay-anyscale
Copy link
Copy Markdown
Contributor

@JoshKarpel were you planning to pick this back up?

@JoshKarpel
Copy link
Copy Markdown
Contributor Author

@JoshKarpel were you planning to pick this back up?

Thanks for the bump - yes, but it keeps getting pushed off my plate with higher-priority internal projects and things like #53173. I really am hoping to get to it in the next few weeks, but I've been saying that for months...

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jun 7, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jun 7, 2025
@github-actions
Copy link
Copy Markdown

This pull request has been automatically closed because there has been no more activity in the 14 days
since being marked stale.

Please feel free to reopen or open a new pull request if you'd still like this to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for your contribution!

@github-actions github-actions bot closed this Jun 22, 2025
@JoshKarpel
Copy link
Copy Markdown
Contributor Author

Woops, I let the PR get closed by the bot :(

I'll make a new PR and link it back here when I get back to this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community serve Ray Serve Related Issue stale The issue is stale. It will be closed within 7 days unless there are further conversation triage Needs triage (eg: priority, bug/not-bug, and owning component)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve] Amortize handle metrics pushing by grouping metrics by process

7 participants