[train] add proper filtering to metrics#53788
[train] add proper filtering to metrics#53788matthewdeng merged 7 commits intoray-project:masterfrom
Conversation
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
There was a problem hiding this comment.
Pull Request Overview
This PR enhances dashboard filtering for training metrics and streamlines the Train Run view.
- Added GPU filters (
GpuIndex,GpuDeviceName) to the base JSON dashboard - Updated all Train panels to use explicit
SessionName,TrainRunName, andTrainRunIdfilters - Simplified the Train Run view by removing system resource panels and standardizing variable quoting
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| python/ray/dashboard/modules/metrics/dashboards/train_grafana_dashboard_base.json | Added GPU filter variables (GpuIndex, GpuDeviceName) |
| python/ray/dashboard/modules/metrics/dashboards/train_dashboard_panels.py | Scoped queries by explicit train-run filters, removed system panels, and adjusted filter quoting |
Comments suppressed due to low confidence (1)
python/ray/dashboard/modules/metrics/dashboards/train_grafana_dashboard_base.json:191
- The GPU variable queries still use
{global_filters}and may include data from all sessions; consider addingSessionName=~\"$SessionName\"to scope GPU values to the current session.
"definition": "label_values(ray_node_gpus_utilization{{{global_filters}}}, GpuIndex)",
| default_uid="rayTrainDashboard", | ||
| rows=TRAIN_GRAFANA_ROWS, | ||
| standard_global_filters=['SessionName=~"$SessionName"'], | ||
| standard_global_filters=["SessionName=~'$SessionName'"], |
There was a problem hiding this comment.
The filter string uses single quotes around $SessionName, which may break the regex match in Grafana; switch to double quotes: SessionName=~\"$SessionName\".
| standard_global_filters=["SessionName=~'$SessionName'"], | |
| standard_global_filters=["SessionName=~\"$SessionName\""], |
Signed-off-by: Matthew Deng <matt@anyscale.com>
| targets=[ | ||
| Target( | ||
| expr="sum(ray_train_controller_state{{{global_filters}}}) by (ray_train_run_name, ray_train_controller_state)", | ||
| expr='sum(ray_train_controller_state{{SessionName=~"$SessionName", ray_train_run_name=~"$TrainRunName", ray_train_run_id=~"$TrainRunId"}}) by (ray_train_run_name, ray_train_controller_state)', |
There was a problem hiding this comment.
@alanwguo I made these all explicit, but is it better to have {{global_filters}} instead?
There was a problem hiding this comment.
you need to keep global_filters on every single panel because there are env vars that lets users add additional global filters.
Signed-off-by: Matthew Deng <matt@anyscale.com>
alanwguo
left a comment
There was a problem hiding this comment.
worked e2e! just need one change
python/ray/dashboard/modules/metrics/dashboards/train_grafana_dashboard_base.json
Outdated
Show resolved
Hide resolved
python/ray/dashboard/modules/metrics/dashboards/train_grafana_dashboard_base.json
Outdated
Show resolved
Hide resolved
1. Added GPU filtering:
- Added `GpuIndex` and `GpuDeviceName` variables to dashboard
- Updated GPU panels to use these variables
- Variables hidden by default
2. Improved Train metrics:
- Added `TrainRunName` and `TrainRunId` to all Train queries
- Standardized quote usage
- Removed `{global_filters}` in favor of explicit variables
3. Simplified Train Run view:
- Reduced to only Train-specific metrics
- Removed system resource panels
---------
Signed-off-by: Matthew Deng <matt@anyscale.com>
1. Added GPU filtering:
- Added `GpuIndex` and `GpuDeviceName` variables to dashboard
- Updated GPU panels to use these variables
- Variables hidden by default
2. Improved Train metrics:
- Added `TrainRunName` and `TrainRunId` to all Train queries
- Standardized quote usage
- Removed `{global_filters}` in favor of explicit variables
3. Simplified Train Run view:
- Reduced to only Train-specific metrics
- Removed system resource panels
---------
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Added GPU filtering:
GpuIndexandGpuDeviceNamevariables to dashboardImproved Train metrics:
TrainRunNameandTrainRunIdto all Train queries{global_filters}in favor of explicit variablesSimplified Train Run view: