[Data] Add autoscaler metrics to Data Dashboard by KaisennHu · Pull Request #60472 · ray-project/ray

KaisennHu · 2026-01-24T07:37:53Z

Description

Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory
Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge

Related issues

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

gemini-code-assist

Code Review

This pull request adds new metrics and dashboard panels for monitoring cluster resource utilization (CPU, GPU, and Object Store Memory) for Ray Data. This is a valuable addition for observing autoscaling behavior. The implementation is solid. I have a couple of minor suggestions to improve code clarity and reduce duplication.

gemini-code-assist · 2026-01-24T07:39:04Z

python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py

 )

+# Ray Data Metrics (Cluster Autoscaler)
+# Default threshold for scaling up is 75% (0.75)


The comment here is a bit confusing. It says "75% (0.75)", but the constant DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD is set to 75 and used as such since the panel's unit is percent. The (0.75) part could be misleading. I suggest updating the comment for clarity.

Suggested change

# Default threshold for scaling up is 75% (0.75)

# Default threshold for scaling up is 75%.

gemini-code-assist · 2026-01-24T07:39:04Z

python/ray/data/_internal/execution/streaming_executor.py

+        # Calculate utilization percentages (0-100)
+        cpu_util = (
+            (global_usage.cpu / global_limits.cpu * 100) if global_limits.cpu else 0
+        )
+        gpu_util = (
+            (global_usage.gpu / global_limits.gpu * 100) if global_limits.gpu else 0
+        )
+        osm_util = (
+            (global_usage.object_store_memory / global_limits.object_store_memory * 100)
+            if global_limits.object_store_memory
+            else 0
+        )


The logic for calculating utilization percentage is repeated for CPU, GPU, and object store memory. To improve maintainability and reduce code duplication, you could extract this logic into a local helper function.

def _calculate_util(usage, limit): return (usage / limit * 100) if limit else 0 # Calculate utilization percentages (0-100) cpu_util = _calculate_util(global_usage.cpu, global_limits.cpu) gpu_util = _calculate_util(global_usage.gpu, global_limits.gpu) osm_util = _calculate_util( global_usage.object_store_memory, global_limits.object_store_memory )

looks like an elegant suggestion :)

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py

bveeramani

Nice. Skimmed and overall LGTM.

Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly

tianyi-ge · 2026-01-25T15:53:00Z

python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py

+    stack=False,
+    thresholds=[
+        {"color": "green", "value": None},
+        {"color": "yellow", "value": DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD},


Will a dotted line automatically be added for a new threshold?

tianyi-ge · 2026-01-25T15:54:21Z

python/ray/data/_internal/execution/streaming_executor.py

+        # Calculate utilization percentages (0-100)
+        cpu_util = (
+            (global_usage.cpu / global_limits.cpu * 100) if global_limits.cpu else 0
+        )
+        gpu_util = (
+            (global_usage.gpu / global_limits.gpu * 100) if global_limits.gpu else 0
+        )
+        osm_util = (
+            (global_usage.object_store_memory / global_limits.object_store_memory * 100)
+            if global_limits.object_store_memory
+            else 0
+        )


looks like an elegant suggestion :)

KaisennHu · 2026-01-26T01:30:23Z

Nice. Skimmed and overall LGTM.

Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly

I’m currently debugging the implementation. I followed the standard steps to start the Ray cluster and Grafana on a single node. While the new panels are visible in the Data Dashboard, both the new panels and the Overview panels show "No data" even when running test cases. Details were sent on Slack. Thanks for your help!

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

KaisennHu · 2026-01-30T08:28:10Z

Nice. Skimmed and overall LGTM.

Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly

Thank you so much for your patient guidance! I’ve identified the root cause: a local network proxy was interfering with the connection between Grafana and Prometheus. After fixing the proxy settings, the Ray Data dashboard is now working perfectly. Really appreciate your help!

bveeramani · 2026-01-30T21:47:31Z

python/ray/data/_internal/execution/streaming_executor.py

+        # Cluster autoscaler utilization gauges
+        self._cluster_cpu_utilization_gauge: Gauge = Gauge(
+            "data_cluster_cpu_utilization",
+            description="Cluster utilization % (CPU)",
+            tag_keys=("dataset",),
+        )
+        self._cluster_gpu_utilization_gauge: Gauge = Gauge(
+            "data_cluster_gpu_utilization",
+            description="Cluster utilization % (GPU)",
+            tag_keys=("dataset",),
+        )
+        self._cluster_object_store_memory_utilization_gauge: Gauge = Gauge(
+            "data_cluster_object_store_memory_utilization",
+            description="Cluster utilization % (Object Store Memory)",
+            tag_keys=("dataset",),
+        )
+


These are metrics that are specific to an autoscaling implementation detail. Rather than placing this in the streaming executor (core scheduling logic), could you move this to RollingLogicalUtilizationGauge? I think we can add an optional execution_id parameter to the constructor, define the gauges in the constructor, and if the execution_id isn't None, we can update update the metrics

Resolved it.

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

bveeramani

Nice

@tianyi-ge

## Description - Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory - Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD - The three charts should go in a new row named "Cluster autoscaler" @tianyi-ge ## Related issues Fixes ray-project#60342 --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

@tianyi-ge

## Description - Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory - Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD - The three charts should go in a new row named "Cluster autoscaler" @tianyi-ge ## Related issues Fixes #60342 --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

@tianyi-ge

## Description - Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory - Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD - The three charts should go in a new row named "Cluster autoscaler" @tianyi-ge ## Related issues Fixes #60342 --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

@tianyi-ge

## Description - Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory - Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD - The three charts should go in a new row named "Cluster autoscaler" @tianyi-ge ## Related issues Fixes ray-project#60342 --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

@tianyi-ge

## Description - Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory - Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD - The three charts should go in a new row named "Cluster autoscaler" @tianyi-ge ## Related issues Fixes ray-project#60342 --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

@tianyi-ge

## Description - Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory - Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD - The three charts should go in a new row named "Cluster autoscaler" @tianyi-ge ## Related issues Fixes ray-project#60342 --------- Signed-off-by: Haichuan Hu <kaisennhu@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[Data] Add autoscaler metrics to Data Dashboard

20d43df

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

KaisennHu requested a review from a team as a code owner January 24, 2026 07:37

gemini-code-assist bot reviewed Jan 24, 2026

View reviewed changes

cursor bot reviewed Jan 24, 2026

View reviewed changes

python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py Outdated Show resolved Hide resolved

ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jan 24, 2026

bveeramani reviewed Jan 24, 2026

View reviewed changes

tianyi-ge reviewed Jan 25, 2026

View reviewed changes

[Data] Add autoscaler metrics to Data Dashboard

be5e31f

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

KaisennHu requested a review from bveeramani January 30, 2026 08:33

bveeramani reviewed Jan 30, 2026

View reviewed changes

[Data] Add autoscaler metrics to Data Dashboard

234c8b9

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

KaisennHu requested a review from bveeramani February 2, 2026 00:51

bveeramani approved these changes Feb 2, 2026

View reviewed changes

bveeramani enabled auto-merge (squash) February 3, 2026 03:11

github-actions bot added the go add ONLY when ready to merge, run all tests label Feb 3, 2026

Merge branch 'master' into add-autoscaler-metrics

2cc35c7

github-actions bot disabled auto-merge February 3, 2026 03:42

bveeramani merged commit 3cf2400 into ray-project:master Feb 3, 2026
6 checks passed

KaisennHu deleted the add-autoscaler-metrics branch February 3, 2026 06:30

	# Default threshold for scaling up is 75% (0.75)
	# Default threshold for scaling up is 75%.

Conversation

KaisennHu commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

tianyi-ge Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

tianyi-ge Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

tianyi-ge Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu commented Jan 26, 2026

Uh oh!

KaisennHu commented Jan 30, 2026

Uh oh!

bveeramani Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

KaisennHu Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

bveeramani left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KaisennHu commented Jan 24, 2026 •

edited

Loading