[Data] Add autoscaler metrics to Data Dashboard#60472
[Data] Add autoscaler metrics to Data Dashboard#60472bveeramani merged 4 commits intoray-project:masterfrom
Conversation
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds new metrics and dashboard panels for monitoring cluster resource utilization (CPU, GPU, and Object Store Memory) for Ray Data. This is a valuable addition for observing autoscaling behavior. The implementation is solid. I have a couple of minor suggestions to improve code clarity and reduce duplication.
| ) | ||
|
|
||
| # Ray Data Metrics (Cluster Autoscaler) | ||
| # Default threshold for scaling up is 75% (0.75) |
There was a problem hiding this comment.
The comment here is a bit confusing. It says "75% (0.75)", but the constant DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD is set to 75 and used as such since the panel's unit is percent. The (0.75) part could be misleading. I suggest updating the comment for clarity.
| # Default threshold for scaling up is 75% (0.75) | |
| # Default threshold for scaling up is 75%. |
| # Calculate utilization percentages (0-100) | ||
| cpu_util = ( | ||
| (global_usage.cpu / global_limits.cpu * 100) if global_limits.cpu else 0 | ||
| ) | ||
| gpu_util = ( | ||
| (global_usage.gpu / global_limits.gpu * 100) if global_limits.gpu else 0 | ||
| ) | ||
| osm_util = ( | ||
| (global_usage.object_store_memory / global_limits.object_store_memory * 100) | ||
| if global_limits.object_store_memory | ||
| else 0 | ||
| ) |
There was a problem hiding this comment.
The logic for calculating utilization percentage is repeated for CPU, GPU, and object store memory. To improve maintainability and reduce code duplication, you could extract this logic into a local helper function.
def _calculate_util(usage, limit):
return (usage / limit * 100) if limit else 0
# Calculate utilization percentages (0-100)
cpu_util = _calculate_util(global_usage.cpu, global_limits.cpu)
gpu_util = _calculate_util(global_usage.gpu, global_limits.gpu)
osm_util = _calculate_util(
global_usage.object_store_memory, global_limits.object_store_memory
)There was a problem hiding this comment.
looks like an elegant suggestion :)
python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py
Outdated
Show resolved
Hide resolved
bveeramani
left a comment
There was a problem hiding this comment.
Nice. Skimmed and overall LGTM.
Did you get a chance to test this locally and see what charts look like? If so, would you mind sharing a screenshot of that? Just want to make sure the charts display correctly
| stack=False, | ||
| thresholds=[ | ||
| {"color": "green", "value": None}, | ||
| {"color": "yellow", "value": DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD}, |
There was a problem hiding this comment.
Will a dotted line automatically be added for a new threshold?
| # Calculate utilization percentages (0-100) | ||
| cpu_util = ( | ||
| (global_usage.cpu / global_limits.cpu * 100) if global_limits.cpu else 0 | ||
| ) | ||
| gpu_util = ( | ||
| (global_usage.gpu / global_limits.gpu * 100) if global_limits.gpu else 0 | ||
| ) | ||
| osm_util = ( | ||
| (global_usage.object_store_memory / global_limits.object_store_memory * 100) | ||
| if global_limits.object_store_memory | ||
| else 0 | ||
| ) |
There was a problem hiding this comment.
looks like an elegant suggestion :)
I’m currently debugging the implementation. I followed the standard steps to start the Ray cluster and Grafana on a single node. While the new panels are visible in the Data Dashboard, both the new panels and the Overview panels show "No data" even when running test cases. Details were sent on Slack. Thanks for your help! |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
| # Cluster autoscaler utilization gauges | ||
| self._cluster_cpu_utilization_gauge: Gauge = Gauge( | ||
| "data_cluster_cpu_utilization", | ||
| description="Cluster utilization % (CPU)", | ||
| tag_keys=("dataset",), | ||
| ) | ||
| self._cluster_gpu_utilization_gauge: Gauge = Gauge( | ||
| "data_cluster_gpu_utilization", | ||
| description="Cluster utilization % (GPU)", | ||
| tag_keys=("dataset",), | ||
| ) | ||
| self._cluster_object_store_memory_utilization_gauge: Gauge = Gauge( | ||
| "data_cluster_object_store_memory_utilization", | ||
| description="Cluster utilization % (Object Store Memory)", | ||
| tag_keys=("dataset",), | ||
| ) | ||
|
|
There was a problem hiding this comment.
These are metrics that are specific to an autoscaling implementation detail. Rather than placing this in the streaming executor (core scheduling logic), could you move this to RollingLogicalUtilizationGauge? I think we can add an optional execution_id parameter to the constructor, define the gauges in the constructor, and if the execution_id isn't None, we can update update the metrics
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge
## Related issues
Fixes ray-project#60342
---------
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge
## Related issues
Fixes #60342
---------
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge
## Related issues
Fixes #60342
---------
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge
## Related issues
Fixes ray-project#60342
---------
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge
## Related issues
Fixes ray-project#60342
---------
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
## Description
- Add three new charts to the Ray Data dashboard called "Cluster
utilization % ({resource})" for each of CPU, GPU, and object store
memory
- Add a dotted line at the DEFAULT_CLUSTER_SCALING_UP_UTIL_THRESHOLD
- The three charts should go in a new row named "Cluster autoscaler"
@tianyi-ge
## Related issues
Fixes ray-project#60342
---------
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
Co-authored-by: Balaji Veeramani <balaji@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>


Description
@tianyi-ge
Related issues
Fixes #60342