[Draft] [serve] Prometheus metrics for AutoscalingConfig by arcyleung · Pull Request #56851 · ray-project/ray

arcyleung · 2025-09-23T21:40:29Z

Implementation of AutoscalingConfig prometheus_custom_metrics:

Why are these changes needed?

Support collection of Prometheus exporter metrics at each replica, and report them to the controller, if the serve config AutoscalingConfig specifies.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Adds optional per-replica Prometheus metric collection for autoscaling, refactors Prometheus helpers into a shared module, and updates proto/config/constants/replica logic and tests.

Serve autoscaling (backend):
- Enable collecting custom Prometheus metrics per replica when AutoscalingConfig.prometheus_metrics is set; fetched at RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST and merged into autoscaling reports.
- Extend replica metrics manager to asynchronously fetch/filter metrics by replica label and include alongside ongoing-requests/custom metrics.
Config/Proto:
- Add prometheus_metrics to AutoscalingConfig (Python) and new proto messages PrometheusCustomMetrics/PrometheusMetric; map list ↔ proto on (de)serialization.
- Update JSON surfaces and defaults accordingly.
Constants:
- Introduce RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST env var for exporter host.
Prometheus utilities:
- New python/ray/_private/prometheus_utils.py with fetch/parse/filter helpers and async fetch; remove duplicated logic from test_utils and update imports across tests/benchmarks.
Tests/BUILD:
- Add test_prometheus_autoscaling_metrics.py; adjust existing tests and BUILD to include env and new imports; minor test stability tweaks.

^{Written by Cursor Bugbot for commit 02a540c. This will update automatically on new commits. Configure here.}

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

gemini-code-assist

Code Review

This pull request adds support for collecting Prometheus metrics for autoscaling decisions in Ray Serve. The changes include updates to the configuration and protobuf definitions, and new logic in the replica to fetch metrics from a Prometheus endpoint. While the feature is a valuable addition, I've found a few critical issues in the implementation. The logic for fetching metrics is incorrect for labeled metrics (like histograms), which will lead to unexpected behavior. There's also a potential TypeError if Prometheus metrics are not configured. Additionally, the tests for this new feature could be strengthened to catch these kinds of issues. My review includes specific suggestions to address these points.

python/ray/serve/_private/replica.py

gemini-code-assist · 2025-09-23T21:42:02Z

python/ray/serve/tests/test_prometheus_autoscaling_metrics.py

+def check_autoscaling_metrics_include_prometheus(
+    client, deployment_id: DeploymentID, expected_metrics: List[str]
+) -> bool:
+    """Check that autoscaling metrics include the expected prometheus metrics."""
+
+    try:
+        metrics = get_autoscaling_metrics_from_controller(client, deployment_id)
+        # The metrics should include both ongoing requests and prometheus custom metrics
+        if not metrics:
+            print("No metrics returned from controller!")
+            return False
+
+        # For prometheus custom metrics, we expect the keys to be present in the dict
+        for expected_metric in expected_metrics:
+            if expected_metric not in metrics:
+                print(f"Expected metric {expected_metric} not found")
+                return False
+        return True
+    except Exception as e:
+        print(f"Error checking metrics: {e}")
+        return False


The test helper check_autoscaling_metrics_include_prometheus and the associated tests are not strong enough. They only check for the presence of the metric key in the results, not the correctness of its value. Because the implementation falls back to 0.0 on any error (including metric not found), the tests will pass even if the metric fetching logic is broken.

To make the tests more robust, they should be updated to:

Use a simple, predictable metric for testing (e.g., a gauge that you can set manually).

Assert that the fetched metric value is correct, not just that the key exists.

This will help catch bugs like the incorrect handling of histogram metrics.

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Signed-off-by: Arthur Leung <arcyleung+github@gmail.com>

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor · 2025-11-01T03:13:25Z

python/ray/serve/_private/autoscaling_state.py

+                    f"Fetching prometheus metric {metric_name} for deployment {self._deployment_id}"
+                )
+                # Call the prometheus query function with the correct signature
+                result = self._prometheus_query_func("localhost:9090", metric_name)


Bug: Hardcoded Prometheus Host; needs Configurable Override

The method _get_prometheus_metrics hardcodes "localhost:9090" as the Prometheus server address when calling self._prometheus_query_func("localhost:9090", metric_name). This is inconsistent with how the replica code handles this in replica.py (line 462), which uses the configurable RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST environment variable. The hardcoded value should be replaced with a configurable host parameter or environment variable to ensure consistency and allow for flexible Prometheus server configurations.

github-actions · 2025-11-15T12:25:03Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: Arthur Leung <arcyleung+github@gmail.com>

cursor · 2025-11-18T15:17:39Z

python/ray/serve/_private/replica.py

+                # Add label selector for this replica
+                query = f'{metric}{{replica="{self._replica_id.unique_id}"}}'
+                logger.info(f"Querying prometheus with: {query}")
+                response = prometheus_handler(


Bug: Broken Dependency Injection Prevents Mocking

The _fetch_prometheus_metrics method calls the module-level prometheus_handler function directly instead of using self._prometheus_handler. This breaks the dependency injection pattern established in the constructor and the set_prometheus_handler method, preventing test code from mocking the Prometheus handler as intended. The call should use self._prometheus_handler(...) instead of prometheus_handler(...).

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor · 2025-11-18T17:04:41Z

python/ray/serve/_private/replica.py

+                    prom_addr,
+                    query,
+                    timeout=RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S,
+                )


Bug: Mocked Handler Ignored, Breaking Tests

The _fetch_prometheus_metrics method calls prometheus_handler instead of self._prometheus_handler. This bypasses the instance variable that can be set via set_prometheus_handler for testing, causing the method to always use the module-level import rather than the potentially mocked handler. This breaks the testing infrastructure where set_prometheus_handler is used to inject mock implementations.

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor · 2025-11-18T17:56:07Z

python/ray/serve/_private/replica.py

            event_loop=self._event_loop,
            autoscaling_config=self._deployment_config.autoscaling_config,
            ingress=ingress,
+            prometheus_handler=prometheus_handler,


Bug: Replica init fails: Unhandled parameter.

The ReplicaBase.__init__ method doesn't accept a prometheus_handler parameter, but line 647 attempts to pass it to create_replica_metrics_manager. When create_replica_impl(**kwargs) is called with prometheus_handler in kwargs, it will fail because ReplicaBase.__init__ doesn't accept this parameter. The prometheus_handler parameter needs to be added to ReplicaBase.__init__'s signature and stored so it can be passed to create_replica_metrics_manager.

cursor · 2025-11-18T17:56:07Z

python/ray/serve/_private/replica.py

+                # Add label selector for this replica
+                query = f'{metric}{{replica="{self._replica_id.unique_id}"}}'
+                logger.info(f"Querying prometheus with: {query}")
+                response = prometheus_handler(


Bug: Prometheus Handler Bypass Breaks Testability

The _fetch_prometheus_metrics method calls prometheus_handler directly instead of self._prometheus_handler. This bypasses the instance variable that can be set via set_prometheus_handler for testing purposes, making it impossible to inject mock handlers. The call should use self._prometheus_handler to respect the configured handler.

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor · 2025-11-18T19:53:24Z

python/ray/serve/_private/replica.py

+                    prom_addr,
+                    query,
+                    timeout=RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S,
+                )


Bug: Prometheus Tests Hit Real Server

The _fetch_prometheus_metrics method calls prometheus_handler instead of self._prometheus_handler, which prevents the injected test handler from being used. This breaks the testing infrastructure where set_prometheus_handler is used to inject mock handlers, causing tests to call the actual Prometheus server instead of the mock.

cursor · 2025-11-18T21:57:00Z

python/ray/serve/_private/replica.py

+                # Add label selector for this replica
+                query = f'{metric}{{replica="{self._replica_id.unique_id}"}}'
+                logger.info(f"Querying prometheus with: {query}")
+                response = prometheus_handler(


Bug: Prometheus Handler: Mocked Logic Bypassed

The _fetch_prometheus_metrics method uses prometheus_handler (which is fetch_from_prom_server) but passes self._prometheus_handler as a parameter during initialization. However, when calling it on line 474, the code uses the global prometheus_handler directly instead of self._prometheus_handler. This means the injected mock handler for testing won't be used, breaking the testing infrastructure. The call should use self._prometheus_handler instead of the global prometheus_handler.

cursor · 2025-11-18T21:57:00Z

python/ray/serve/_private/autoscaling_state.py

+                    f"Fetching prometheus metric {metric_name} for deployment {self._deployment_id}"
+                )
+                # Call the prometheus query function with the correct signature
+                result = self._prometheus_query_func("localhost:9090", metric_name)


Bug: Prometheus Address Inconsistency Breaks Metrics

The Prometheus server address is hardcoded to "localhost:9090" in _get_prometheus_metrics, but replicas use the configurable RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST environment variable. This inconsistency means the controller and replicas may query different Prometheus servers, causing metrics collection to fail or return incorrect data. The controller should use the same configurable address as replicas.

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

…into prometheus-metrics

cursor · 2025-11-18T22:45:05Z

python/ray/serve/tests/test_prometheus_autoscaling_metrics.py

+
+
+# Patch the prometheus_handler at module level
+ray.serve._private.replica.prometheus_handler = mock_prometheus_handler


Bug: Prometheus Handler: Module Variable Attribute Error

The test attempts to set ray.serve._private.replica.prometheus_handler but this module-level variable doesn't exist in replica.py. The module only imports default_prometheus_handler from ray._common.prometheus_utils and uses it as a default parameter. This will cause an AttributeError or create a new module attribute that won't affect the actual code behavior since the default parameter value is evaluated at function definition time, not at call time.

cursor · 2025-11-18T22:45:05Z

python/ray/serve/tests/BUILD.bazel

                "RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE": "0",
                "RAY_SERVE_REPLICA_AUTOSCALING_METRIC_RECORD_INTERVAL_S": "0.5",
                "RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S": "3",
+                "RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST": "http://localhost:9090",


Bug: Redundant HTTP Prefix Malforms Prometheus URL

The environment variable RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST is set to "http://localhost:9090" which includes the http:// prefix. However, fetch_from_prom_server in prometheus_utils.py expects just the hostname:port and prepends http:// itself, resulting in a malformed URL http://http://localhost:9090/api/v1/query that will cause requests to fail.

cursor · 2025-11-18T22:45:06Z

python/ray/serve/tests/BUILD.bazel

                "RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE": "0",
                "RAY_SERVE_REPLICA_AUTOSCALING_METRIC_RECORD_INTERVAL_S": "0.5",
                "RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S": "3",
+                "RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST": "http://localhost:9090",


Bug: URL Prefix Duplication Breaks Prometheus Calls

The environment variable RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST is set to "http://localhost:9090" which includes the http:// prefix. However, fetch_from_prom_server in prometheus_utils.py expects just the hostname:port and prepends http:// itself, resulting in a malformed URL http://http://localhost:9090/api/v1/query that will cause requests to fail.

Signed-off-by: Arthur Leung <arcyleung+github@gmail.com>

cursor · 2025-11-21T15:48:20Z

python/ray/serve/tests/BUILD.bazel

                "RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE": "0",
                "RAY_SERVE_REPLICA_AUTOSCALING_METRIC_RECORD_INTERVAL_S": "0.5",
                "RAY_SERVE_RECORD_AUTOSCALING_STATS_TIMEOUT_S": "3",
+                "RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST": "http://localhost:9090",


Bug: Double HTTP prefix in Prometheus URL construction

The environment variable RAY_SERVE_REPLICA_AUTOSCALING_METRIC_PROMETHEUS_HOST is set to "http://localhost:9090" in the test configuration, but the fetch_from_prom_server function in prometheus_utils.py already prepends "http://" to the address parameter when constructing the URL. This results in a malformed URL like "http://http://localhost:9090/api/v1/query" with a double HTTP prefix, which will cause Prometheus queries to fail. The environment variable should be set to just "localhost:9090" without the protocol prefix.

Arthur Leung added 23 commits September 15, 2025 21:15

Include custom metrics method and report to controller

72441c0

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Add timeout handler

9dba402

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Add integration test for default record_autoscaling_stats method

d41b399

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Cleanup and add non-default method test

c697c19

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Add metric timeout test

7946111

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Fix test_deployment_state

97aa8c4

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Pass user_callable_wrapper into ReplicaMetricsManager

1368247

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Address PR comments

b96a0f3

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Update test

e0d567e

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Fixing test_custom_serve_timeout

3ff3d8a

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Fix timeout test and add bazel test env var

c3f166b

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Cleanup async processing and add runtime check

57de88f

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Set autoscaling metric env vars only for test_custom_metrics.py

97d69bf

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Add initialization checks

673c75e

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Lint

89f2f24

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Fix record_request_metrics_for_replica condition

90ca08d

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Address comments part 3

35742fb

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Fix linter errors

c803566

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Fix tag keys

9340f65

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Add validation check and test

da71014

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Merge remote-tracking branch 'origin/master' into prometheus-metrics

1dca9bb

Merge upstream changes

6a30354

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Cleanup integration test

027da70

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

arcyleung requested a review from a team as a code owner September 23, 2025 21:40

arcyleung changed the title ~~[serve] Prometheus metrics~~ [serve] Prometheus metrics for AutoscalingConfig Sep 23, 2025

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

Fix init logic

2e1c169

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

ray-gardener bot added serve Ray Serve Related Issue docs An issue or change related to documentation labels Sep 24, 2025

arcyleung changed the title ~~[serve] Prometheus metrics for AutoscalingConfig~~ [Draft] [serve] Prometheus metrics for AutoscalingConfig Oct 20, 2025

Remove asyncio for now

c38f8ad

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

This comment was marked as outdated.

Sign in to view

Arthur Leung added 2 commits October 30, 2025 00:16

Fix unittest

57d0f9f

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Merge remote-tracking branch 'origin/master' into prometheus-metrics

beeb5b0

This comment was marked as outdated.

Sign in to view

arcyleung mentioned this pull request Oct 30, 2025

[Serve] Ray Serve Autoscaling supports the configuration of custom-metrics and policy #51632

Closed

Merge branch 'master' into prometheus-metrics

9e66e00

Signed-off-by: Arthur Leung <arcyleung+github@gmail.com>

This comment was marked as outdated.

Sign in to view

Fix merge conflicts

94e5c39

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor bot reviewed Nov 1, 2025

View reviewed changes

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 15, 2025

Merge branch 'master' into prometheus-metrics

1df5a28

Signed-off-by: Arthur Leung <arcyleung+github@gmail.com>

cursor bot reviewed Nov 18, 2025

View reviewed changes

Fix lint

ecb5807

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor bot reviewed Nov 18, 2025

View reviewed changes

Fix import paths

45e9be2

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor bot reviewed Nov 18, 2025

View reviewed changes

Fix import path for test_reporter.py

65c5952

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

cursor bot reviewed Nov 18, 2025

View reviewed changes

Merge branch 'master' into prometheus-metrics

10b9d47

cursor bot reviewed Nov 18, 2025

View reviewed changes

Arthur Leung added 2 commits November 18, 2025 17:42

Use self._prometheus_handler

ea75d59

Signed-off-by: Arthur Leung <arcyleung@gmail.com>

Merge branch 'prometheus-metrics' of https://github.com/arcyleung/ray …

7d3d65b

…into prometheus-metrics

cursor bot reviewed Nov 18, 2025

View reviewed changes

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Nov 19, 2025

Merge branch 'master' into prometheus-metrics

2510071

Signed-off-by: Arthur Leung <arcyleung+github@gmail.com>

cursor bot reviewed Nov 21, 2025

View reviewed changes



		# Patch the prometheus_handler at module level
		ray.serve._private.replica.prometheus_handler = mock_prometheus_handler

Conversation

arcyleung commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

cursor bot Nov 1, 2025

Choose a reason for hiding this comment

Bug: Hardcoded Prometheus Host; needs Configurable Override

Uh oh!

github-actions bot commented Nov 15, 2025

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Broken Dependency Injection Prevents Mocking

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Mocked Handler Ignored, Breaking Tests

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Replica init fails: Unhandled parameter.

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Prometheus Handler Bypass Breaks Testability

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Prometheus Tests Hit Real Server

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Prometheus Handler: Mocked Logic Bypassed

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Prometheus Address Inconsistency Breaks Metrics

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Prometheus Handler: Module Variable Attribute Error

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: Redundant HTTP Prefix Malforms Prometheus URL

Uh oh!

cursor bot Nov 18, 2025

Choose a reason for hiding this comment

Bug: URL Prefix Duplication Breaks Prometheus Calls

Uh oh!

cursor bot Nov 21, 2025

Choose a reason for hiding this comment

Bug: Double HTTP prefix in Prometheus URL construction

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

arcyleung commented Sep 23, 2025 •

edited

Loading