Skip to content

[Data] Add autoscaler metrics to Data Dashboard#60472

Merged
bveeramani merged 4 commits intoray-project:masterfrom
KaisennHu:add-autoscaler-metrics
Feb 3, 2026
Merged

[Data] Add autoscaler metrics to Data Dashboard#60472
bveeramani merged 4 commits intoray-project:masterfrom
KaisennHu:add-autoscaler-metrics

Conversation

@KaisennHu
Copy link
Copy Markdown
Contributor

@KaisennHu KaisennHu commented Jan 24, 2026

Description

  • Add three new charts to the Ray Data dashboard called "Cluster utilization % ({resource})" for each of CPU, GPU, and object store memory
  • Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
  • The three charts should go in a new row named "Cluster autoscaler"
    @tianyi-ge

Related issues

Fixes #60342

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@KaisennHu KaisennHu requested a review from a team as a code owner January 24, 2026 07:37
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds new metrics and dashboard panels for monitoring cluster resource utilization (CPU, GPU, and Object Store Memory) for Ray Data. This is a valuable addition for observing autoscaling behavior. The implementation is solid. I have a couple of minor suggestions to improve code clarity and reduce duplication.

)

# Ray Data Metrics (Cluster Autoscaler)
# Default threshold for scaling up is 75% (0.75)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here is a bit confusing. It says "75% (0.75)", but the constant DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD is set to 75 and used as such since the panel's unit is percent. The (0.75) part could be misleading. I suggest updating the comment for clarity.

Suggested change
# Default threshold for scaling up is 75% (0.75)
# Default threshold for scaling up is 75%.

Comment on lines +401 to +412
# Calculate utilization percentages (0-100)
cpu_util = (
(global_usage.cpu / global_limits.cpu * 100) if global_limits.cpu else 0
)
gpu_util = (
(global_usage.gpu / global_limits.gpu * 100) if global_limits.gpu else 0
)
osm_util = (
(global_usage.object_store_memory / global_limits.object_store_memory * 100)
if global_limits.object_store_memory
else 0
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for calculating utilization percentage is repeated for CPU, GPU, and object store memory. To improve maintainability and reduce code duplication, you could extract this logic into a local helper function.

        def _calculate_util(usage, limit):
            return (usage / limit * 100) if limit else 0

        # Calculate utilization percentages (0-100)
        cpu_util = _calculate_util(global_usage.cpu, global_limits.cpu)
        gpu_util = _calculate_util(global_usage.gpu, global_limits.gpu)
        osm_util = _calculate_util(
            global_usage.object_store_memory, global_limits.object_store_memory
        )

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like an elegant suggestion :)

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

@ray-gardener ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jan 24, 2026
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Skimmed and overall LGTM.

Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly

stack=False,
thresholds=[
{"color": "green", "value": None},
{"color": "yellow", "value": DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a dotted line automatically be added for a new threshold?

Comment on lines +401 to +412
# Calculate utilization percentages (0-100)
cpu_util = (
(global_usage.cpu / global_limits.cpu * 100) if global_limits.cpu else 0
)
gpu_util = (
(global_usage.gpu / global_limits.gpu * 100) if global_limits.gpu else 0
)
osm_util = (
(global_usage.object_store_memory / global_limits.object_store_memory * 100)
if global_limits.object_store_memory
else 0
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like an elegant suggestion :)

@KaisennHu
Copy link
Copy Markdown
Contributor Author

Nice. Skimmed and overall LGTM.

Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly

I’m currently debugging the implementation. I followed the standard steps to start the Ray cluster and Grafana on a single node. While the new panels are visible in the Data Dashboard, both the new panels and the Overview panels show "No data" even when running test cases. Details were sent on Slack. Thanks for your help!

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@KaisennHu
Copy link
Copy Markdown
Contributor Author

Nice. Skimmed and overall LGTM.

Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly

Thank you so much for your patient guidance! I’ve identified the root cause: a local network proxy was interfering with the connection between Grafana and Prometheus. After fixing the proxy settings, the Ray Data dashboard is now working perfectly. Really appreciate your help!
image

@KaisennHu KaisennHu requested a review from bveeramani January 30, 2026 08:33
Comment on lines +158 to +174
# Cluster autoscaler utilization gauges
self._cluster_cpu_utilization_gauge: Gauge = Gauge(
"data_cluster_cpu_utilization",
description="Cluster utilization % (CPU)",
tag_keys=("dataset",),
)
self._cluster_gpu_utilization_gauge: Gauge = Gauge(
"data_cluster_gpu_utilization",
description="Cluster utilization % (GPU)",
tag_keys=("dataset",),
)
self._cluster_object_store_memory_utilization_gauge: Gauge = Gauge(
"data_cluster_object_store_memory_utilization",
description="Cluster utilization % (Object Store Memory)",
tag_keys=("dataset",),
)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are metrics that are specific to an autoscaling implementation detail. Rather than placing this in the streaming executor (core scheduling logic), could you move this to RollingLogicalUtilizationGauge? I think we can add an optional execution_id parameter to the constructor, define the gauges in the constructor, and if the execution_id isn't None, we can update update the metrics

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved it.
c75d299c-9ecd-47d4-9180-62a29b53b14f

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@KaisennHu KaisennHu requested a review from bveeramani February 2, 2026 00:51
Copy link
Copy Markdown
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@bveeramani bveeramani enabled auto-merge (squash) February 3, 2026 03:11
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Feb 3, 2026
@github-actions github-actions bot disabled auto-merge February 3, 2026 03:42
@bveeramani bveeramani merged commit 3cf2400 into ray-project:master Feb 3, 2026
6 checks passed
@KaisennHu KaisennHu deleted the add-autoscaler-metrics branch February 3, 2026 06:30
rayhhome pushed a commit to rayhhome/ray that referenced this pull request Feb 4, 2026
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge

## Related issues
Fixes ray-project#60342

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge 

## Related issues
Fixes #60342

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Feb 9, 2026
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge 

## Related issues
Fixes #60342

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge

## Related issues
Fixes ray-project#60342

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge

## Related issues
Fixes ray-project#60342

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge

## Related issues
Fixes ray-project#60342

---------

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add autoscaler metrics to Data Dashboard

3 participants