[Serve] Group `DeploymentHandle` autoscaling metrics pushes by process by JoshKarpel · Pull Request #55016 · ray-project/ray

JoshKarpel · 2025-07-29T19:02:20Z

Why are these changes needed?

We're seeing a lot of pressure on the Serve Controller from metrics push tasks when running thousands of Serve apps. A lot of this pressure is purely from the overhead of lots of RPC connections incoming to the controller. We can amortize this overhead (and presumably similar overhead in the handles too) by having the metrics push happen at the per-process level instead of the per-handle level.

We've made this change on our setup and it has reduced CPU time spent on this in the Controller, and also our ingress application replicas that have all the handles.

Related issue number

Closes #45777

Second attempt on #45957 , see previous discussion there for more context.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

…shing-v2 Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

# Conflicts: # python/ray/serve/_private/router.py

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

gemini-code-assist

Summary of Changes

Hello @JoshKarpel, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the autoscaling metrics reporting within Ray Serve to alleviate RPC overhead on the Serve Controller. The core change involves grouping DeploymentHandle autoscaling metrics at the process level before transmitting them, rather than sending individual reports. This refactoring is expected to improve the scalability and performance of the Serve system, particularly when managing a large number of applications. Additionally, it introduces more explicit and configurable intervals for metrics collection and pushing.

Highlights

Metrics Grouping for Performance: I've introduced a SharedHandleMetricsPusher class that aggregates autoscaling metrics from multiple DeploymentHandles within the same process. This allows for sending these metrics to the controller in a single bulk RPC call, significantly reducing network overhead and pressure on the Serve Controller, especially in large-scale deployments.
Standardized Metrics Reporting: I've refactored the internal metrics reporting by creating a dedicated HandleMetricReport dataclass. This new structure encapsulates all necessary handle metrics, simplifying the API for AutoscalingStateManager and the Serve Controller, which now directly consume these unified report objects.
Granular Control over Intervals: I've deprecated the metrics_interval_s field in AutoscalingConfig in favor of new environment variables (e.g., RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PUSH_INTERVAL_S, RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S). This provides more precise and independent control over the recording and pushing intervals for both replica and handle autoscaling metrics.
Controller API Enhancement: I've added a new bulk_record_handle_metrics method to the Serve Controller. This method is designed to efficiently receive and process sequences of HandleMetricReport objects, supporting the new grouped metrics push mechanism.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Overall, this is a well-executed pull request that addresses a significant performance bottleneck in Ray Serve. The introduction of SharedHandleMetricsPusher to batch metric reports is a smart solution. The related code refactoring, such as using the HandleMetricReport dataclass, improves code clarity and maintainability.

My review includes one high-priority comment regarding a potential bug in an environment variable name and one medium-priority comment about a future performance consideration for the new shared pusher. Once these are addressed, the PR should be in great shape.

python/ray/serve/_private/constants.py

python/ray/serve/_private/router.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

…deprecated (#55102) ## Why are these changes needed? Deprecation announcements for this future PR #55016 ## Related issue number Related to #45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

# Conflicts: # python/ray/serve/_private/constants.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/config.py # src/ray/protobuf/serve.proto

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

…deprecated (#55102) ## Why are these changes needed? Deprecation announcements for this future PR #55016 ## Related issue number Related to #45777 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Josh Karpel <josh.karpel@gmail.com> Signed-off-by: sampan <sampan@anyscale.com>

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2025-10-16T22:27:41Z

python/ray/serve/tests/unit/test_deployment_version.py

-            autoscaling_config={"max_replicas": 2, "metrics_interval_s": 10}
+            autoscaling_config={"max_replicas": 2, "look_back_period_s": 10}
        ),
        {},
    )
    v2 = DeploymentVersion(
        "1",
        DeploymentConfig(
-            autoscaling_config={"max_replicas": 5, "metrics_interval_s": 10}
+            autoscaling_config={"max_replicas": 5, "look_back_period_s": 10}
        ),
        {},
    )
    v3 = DeploymentVersion(
        "1",
        DeploymentConfig(
-            autoscaling_config={"max_replicas": 2, "metrics_interval_s": 3}
+            autoscaling_config={"max_replicas": 2, "look_back_period_s": 3}


This changed because metrics_interval_s no longer requires a reconfig due to https://github.com/ray-project/ray/pull/55016/files#diff-75ae406c06c8505d98c21ff05842d770fc5b52870864bfb9e70fd5e38ba7bd81R194 (because it doesn't do anything anymore)

JoshKarpel · 2025-10-16T22:28:33Z

python/ray/serve/tests/unit/test_router.py

+        with patch.object(metrics_manager, "metrics_pusher"):
+            metrics_manager.update_deployment_config(DeploymentConfig(), 0)
+            metrics_manager.metrics_pusher.register_or_update_task.assert_not_called()

        # With autoscaling config, register or update task should be called
-        metrics_manager.update_deployment_config(
-            DeploymentConfig(autoscaling_config=AutoscalingConfig()), 0
-        )
-        metrics_manager.metrics_pusher.register_or_update_task.assert_called()
+        with patch.object(metrics_manager, "metrics_pusher"):
+            metrics_manager.update_deployment_config(
+                DeploymentConfig(autoscaling_config=AutoscalingConfig()), 0
+            )
+            metrics_manager.metrics_pusher.register_or_update_task.assert_called()


This changed because the shared metrics pusher does have register_or_update_task called in these cases. So instead of mocking all metrics pushers, we just mock the one directly attached to the metrics manager under test.

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

# Conflicts: # python/ray/serve/_private/autoscaling_state.py # python/ray/serve/_private/controller.py

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2025-10-17T19:27:10Z

python/ray/serve/tests/test_deploy_app.py

+    print(time.ctime(), "Redeploying with `look_back_period_s` updated to 0.5s.")
+    config_template["deployments"][0]["autoscaling_config"]["look_back_period_s"] = 0.5


Changing metrics_interval_s no longer triggers a redeploy

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel · 2025-10-20T20:11:49Z

python/ray/serve/tests/test_deploy_app.py

                    "target_ongoing_requests": 1,
                    "min_replicas": 1,
                    "max_replicas": 10,
-                    "metrics_interval_s": 15,


@abrarsheikh I'm running into some trouble on fixing this test and could use some guidance. It looks like the intent of this test was that by setting metrics_interval_s=15, the wait_for_condition on https://github.com/ray-project/ray/pull/55016/files#diff-fb12df9046242e153209be3f7daa63c0dbedbbfb6eb0fd8d7fff48467fd1887bR746 with a default timeout of 10 seconds would fail, because autoscaling metrics wouldn't be pushed, thus no scaleup.

Since the push interval can no longer be controlled like this, this part of the test doesn't make sense, but the name of the test doesn't really describe what the intent is here. Do you know what behavior this test is supposed to be covering so that I can figure out how to test that behavior now?

Ok, well, apparently CI passed anyway 😅

Nope, it's ephemeral, which isn't surprising - so yeah, I could use some help on this one when you get a chance @abrarsheikh

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

python/ray/serve/_private/router.py

abrarsheikh · 2025-10-27T17:52:49Z

@JoshKarpel quite busy this week, I will pick this up after Ray Rummit.

JoshKarpel · 2025-10-27T18:37:33Z

@JoshKarpel quite busy this week, I will pick this up after Ray Rummit.

Same here! See you soon 😁

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

cursor · 2025-12-02T19:15:41Z

python/ray/serve/_private/router.py

+        pusher = cls(controller_handle=controller_handle)
+        pusher.start()
+        logger.info(f"Created {pusher} for Serve Controller {controller_handle}.")
+        return pusher


Bug: Missing event loop in cache key causes potential thread-safety issues

The SharedHandleMetricsPusher.get_or_create method only uses controller_handle as the lru_cache key, while the analogous SharedRouterLongPollClient.get_or_create correctly uses both controller_handle AND event_loop as cache keys. This inconsistency means that routers running on different event loops (e.g., SingletonThreadRouter vs CurrentLoopRouter in the same process) could share the same SharedHandleMetricsPusher instance. Since the pusher's MetricsPusher tasks run on whichever event loop first created the pusher, this can lead to cross-thread calls to _get_metrics_report() which accesses non-thread-safe data structures like metrics_store and the WeakSet containing RouterMetricsManager instances.

Additional Locations (1)

python/ray/serve/_private/router.py#L257-L258

@abrarsheikh thoughts on this one? I'm not too familiar with the recent changes around event loops in this code.

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

cursor · 2025-12-03T20:24:08Z

python/ray/serve/_private/router.py

+        self._router_metrics_managers.add(router_metrics_manager)
+
+    def unregister(self, router_metrics_manager: RouterMetricsManager) -> None:
+        self._router_metrics_managers.discard(router_metrics_manager)


Bug: Missing thread-safety protection for WeakSet iteration during push_metrics

SharedHandleMetricsPusher.register() and unregister() directly modify _router_metrics_managers WeakSet without thread-safety protection, while push_metrics() iterates over it. Unlike SharedRouterLongPollClient, which uses call_soon_threadsafe to prevent "RuntimeError: Set changed size during iteration", SharedHandleMetricsPusher lacks this protection. If register/unregister and push_metrics execute on different threads, or if garbage collection removes weak references during iteration, a RuntimeError could occur.

Additional Locations (1)

python/ray/serve/_private/router.py#L509-L510

harshit-anyscale · 2026-01-30T06:54:04Z

@JoshKarpel can you please resolve the merge conflicts?

JoshKarpel · 2026-02-05T02:38:47Z

@JoshKarpel can you please resolve the merge conflicts?

@harshit-anyscale @abrarsheikh apologies for the lack of progress on this - my focus has shifted to another project and I'm hoping that my coworker @jugalshah291 can pick up where I left off on this one.

harshit-anyscale · 2026-02-05T09:57:37Z

no issues @JoshKarpel, thanks for contributing :)
@jugalshah291 let us know if you need any help in reviving this PR or have any questions.

JoshKarpel added 12 commits June 10, 2024 16:54

introduce shared handle metrics pusher

9b3e368

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

minor optimizations

45dc6e5

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'refs/heads/master' into issue-45777-amortize-metrics-pu…

7cf7299

…shing-v2 Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

debug logs

84f6c10

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

updated HandleMetricReport docstring, change to debug log

4a48902

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

af5d76a

tidy up

415a44b

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

088800f

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

86da5fa

# Conflicts: # python/ray/serve/_private/router.py

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

b87c75e

# Conflicts: # python/ray/serve/_private/router.py

deprecate AutoscalingConfig.metrics_interval_s

f771410

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

e5b4123

gemini-code-assist bot reviewed Jul 29, 2025

View reviewed changes

python/ray/serve/_private/constants.py Outdated Show resolved Hide resolved

python/ray/serve/_private/router.py Outdated Show resolved Hide resolved

JoshKarpel and others added 8 commits July 29, 2025 14:07

Update python/ray/serve/_private/constants.py

ff60ebe

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

tidy up a few things on the metrics report generation path

49b2188

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

lint

bf6abbf

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

c99a762

still supporting 3.9

a0618d5

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

e16cae0

remove more uses of metrics_interval_s

3bdfd6a

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

e093563

JoshKarpel mentioned this pull request Jul 31, 2025

[Serve] Mark per-deployment autoscaling metrics push interval config deprecated #55102

Merged

8 tasks

JoshKarpel added 3 commits August 11, 2025 09:44

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

b56e914

# Conflicts: # python/ray/serve/_private/constants.py # python/ray/serve/_private/replica.py # python/ray/serve/_private/router.py # python/ray/serve/config.py # src/ray/protobuf/serve.proto

use statistics.fmean, increase timeout

69df796

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

update deprecation statements

ebb95c0

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel added 2 commits August 12, 2025 09:43

revert

ef7726a

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

c58162a

This comment was marked as outdated.

Sign in to view

JoshKarpel added 2 commits October 16, 2025 13:59

fix test_metrics.py

2ea5c12

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

3ee8eaa

JoshKarpel commented Oct 16, 2025

View reviewed changes

JoshKarpel added 5 commits October 17, 2025 10:10

tidy up diff

06bda24

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

trying to fix test_deploy_app

4d159b3

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

2d6c6ca

# Conflicts: # python/ray/serve/_private/autoscaling_state.py # python/ray/serve/_private/controller.py

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

cabcdf4

more fixes for test_deploy_app

3318d14

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel commented Oct 17, 2025

View reviewed changes

JoshKarpel added 2 commits October 20, 2025 15:02

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

02da99d

fixes for test_deploy_app_2.py

58799a2

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel commented Oct 20, 2025

View reviewed changes

JoshKarpel added 2 commits October 21, 2025 12:03

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

0262c63

docs tweaks

90cff3d

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

cursor bot reviewed Oct 21, 2025

View reviewed changes

python/ray/serve/_private/router.py Show resolved Hide resolved

JoshKarpel added 2 commits December 1, 2025 13:09

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

d63ff38

Merge branch 'master' into issue-45777-amortize-metrics-pushing-v2

06ecd6e

JoshKarpel requested a review from a team as a code owner December 2, 2025 18:56

fixing tests

4eb3514

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

JoshKarpel force-pushed the issue-45777-amortize-metrics-pushing-v2 branch from 6b1f2f5 to 4eb3514 Compare December 2, 2025 18:57

cursor bot reviewed Dec 2, 2025

View reviewed changes

do we need to hang on to the tasks?

10d20d4

Signed-off-by: Josh Karpel <josh.karpel@gmail.com>

cursor bot reviewed Dec 3, 2025

View reviewed changes

abrarsheikh mentioned this pull request Feb 2, 2026

[Serve] Serve Controller/Cluster Scalability Experiment #60680

Open

		print(time.ctime(), "Redeploying with `look_back_period_s` updated to 0.5s.")
		config_template["deployments"][0]["autoscaling_config"]["look_back_period_s"] = 0.5

Conversation

JoshKarpel commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

JoshKarpel Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

JoshKarpel Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abrarsheikh commented Oct 27, 2025

Uh oh!

JoshKarpel commented Oct 27, 2025

Uh oh!

cursor bot Dec 2, 2025

Choose a reason for hiding this comment

Bug: Missing event loop in cache key causes potential thread-safety issues

Uh oh!

JoshKarpel Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

cursor bot Dec 3, 2025

Choose a reason for hiding this comment

Bug: Missing thread-safety protection for WeakSet iteration during push_metrics

Uh oh!

harshit-anyscale commented Jan 30, 2026

Uh oh!

JoshKarpel commented Feb 5, 2026

Uh oh!

harshit-anyscale commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JoshKarpel commented Jul 29, 2025 •

edited

Loading