Skip to content

[train] add proper filtering to metrics#53788

Merged
matthewdeng merged 7 commits intoray-project:masterfrom
matthewdeng:train-dash-filters
Jun 18, 2025
Merged

[train] add proper filtering to metrics#53788
matthewdeng merged 7 commits intoray-project:masterfrom
matthewdeng:train-dash-filters

Conversation

@matthewdeng
Copy link
Copy Markdown
Contributor

  1. Added GPU filtering:

    • Added GpuIndex and GpuDeviceName variables to dashboard
    • Updated GPU panels to use these variables
    • Variables hidden by default
  2. Improved Train metrics:

    • Added TrainRunName and TrainRunId to all Train queries
    • Standardized quote usage
    • Removed {global_filters} in favor of explicit variables
  3. Simplified Train Run view:

    • Reduced to only Train-specific metrics
    • Removed system resource panels

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: Matthew Deng <matt@anyscale.com>
Copilot AI review requested due to automatic review settings June 13, 2025 03:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances dashboard filtering for training metrics and streamlines the Train Run view.

  • Added GPU filters (GpuIndex, GpuDeviceName) to the base JSON dashboard
  • Updated all Train panels to use explicit SessionName, TrainRunName, and TrainRunId filters
  • Simplified the Train Run view by removing system resource panels and standardizing variable quoting

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
python/ray/dashboard/modules/metrics/dashboards/train_grafana_dashboard_base.json Added GPU filter variables (GpuIndex, GpuDeviceName)
python/ray/dashboard/modules/metrics/dashboards/train_dashboard_panels.py Scoped queries by explicit train-run filters, removed system panels, and adjusted filter quoting
Comments suppressed due to low confidence (1)

python/ray/dashboard/modules/metrics/dashboards/train_grafana_dashboard_base.json:191

  • The GPU variable queries still use {global_filters} and may include data from all sessions; consider adding SessionName=~\"$SessionName\" to scope GPU values to the current session.
"definition": "label_values(ray_node_gpus_utilization{{{global_filters}}}, GpuIndex)",

default_uid="rayTrainDashboard",
rows=TRAIN_GRAFANA_ROWS,
standard_global_filters=['SessionName=~"$SessionName"'],
standard_global_filters=["SessionName=~'$SessionName'"],
Copy link

Copilot AI Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filter string uses single quotes around $SessionName, which may break the regex match in Grafana; switch to double quotes: SessionName=~\"$SessionName\".

Suggested change
standard_global_filters=["SessionName=~'$SessionName'"],
standard_global_filters=["SessionName=~\"$SessionName\""],

Copilot uses AI. Check for mistakes.
Signed-off-by: Matthew Deng <matt@anyscale.com>
targets=[
Target(
expr="sum(ray_train_controller_state{{{global_filters}}}) by (ray_train_run_name, ray_train_controller_state)",
expr='sum(ray_train_controller_state{{SessionName=~"$SessionName", ray_train_run_name=~"$TrainRunName", ray_train_run_id=~"$TrainRunId"}}) by (ray_train_run_name, ray_train_controller_state)',
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanwguo I made these all explicit, but is it better to have {{global_filters}} instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to keep global_filters on every single panel because there are env vars that lets users add additional global filters.

Signed-off-by: Matthew Deng <matt@anyscale.com>
@matthewdeng matthewdeng requested a review from alanwguo June 17, 2025 01:54
Copy link
Copy Markdown
Contributor

@alanwguo alanwguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worked e2e! just need one change

Signed-off-by: Matthew Deng <matt@anyscale.com>
@matthewdeng matthewdeng enabled auto-merge (squash) June 18, 2025 21:38
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Jun 18, 2025
@matthewdeng matthewdeng merged commit 3c80765 into ray-project:master Jun 18, 2025
6 of 7 checks passed
minerharry pushed a commit to minerharry/ray that referenced this pull request Jun 27, 2025
1. Added GPU filtering:
   - Added `GpuIndex` and `GpuDeviceName` variables to dashboard
   - Updated GPU panels to use these variables
   - Variables hidden by default

2. Improved Train metrics:
   - Added `TrainRunName` and `TrainRunId` to all Train queries
   - Standardized quote usage
   - Removed `{global_filters}` in favor of explicit variables

3. Simplified Train Run view:
   - Reduced to only Train-specific metrics
   - Removed system resource panels

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Jul 2, 2025
1. Added GPU filtering:
   - Added `GpuIndex` and `GpuDeviceName` variables to dashboard
   - Updated GPU panels to use these variables
   - Variables hidden by default

2. Improved Train metrics:
   - Added `TrainRunName` and `TrainRunId` to all Train queries
   - Standardized quote usage
   - Removed `{global_filters}` in favor of explicit variables

3. Simplified Train Run view:
   - Reduced to only Train-specific metrics
   - Removed system resource panels

---------

Signed-off-by: Matthew Deng <matt@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants