Skip to content

[Serve] Expose health metrics from controller#60473

Merged
abrarsheikh merged 12 commits intomasterfrom
controller-health
Jan 29, 2026
Merged

[Serve] Expose health metrics from controller#60473
abrarsheikh merged 12 commits intomasterfrom
controller-health

Conversation

@abrarsheikh
Copy link
Copy Markdown
Contributor

@abrarsheikh abrarsheikh commented Jan 24, 2026

Adds a new get_health_metrics() API to the Serve controller and a serve controller-health CLI command to expose performance metrics that help diagnose controller issues as cluster size increases.

Test Plan

  • Unit tests for DurationStats, ControllerHealthMetrics, ControllerHealthMetricsTracker
  • E2E test for get_health_metrics() API
  • CLI test for serve controller-health
❯ ray start --head --metrics-export-port=8080
2026-01-28 01:11:24,923 - INFO - Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2026-01-28 01:11:24,923 - INFO - NumExpr defaulting to 8 threads.
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 172.31.7.228

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='172.31.7.228:6379'

  To connect to this Ray cluster:
    import ray
    ray.init()

  To submit a Ray job using the Ray Jobs CLI:
    RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265' ray job submit --working-dir . -- python my_script.py

  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
  for more information on submitting Ray jobs to the Ray cluster.

  To terminate the Ray runtime, run
    ray stop

  To view the status of the cluster, use
    ray status

  To monitor and debug Ray, view the dashboard at
    127.0.0.1:8265

  If connection to the dashboard fails, check your firewall settings and network configuration.
❯ serve start
2026-01-28 01:11:31,295	INFO worker.py:1818 -- Connecting to existing Ray cluster at address: 172.31.7.228:6379...
2026-01-28 01:11:31,315	INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
/home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
(ProxyActor pid=1976778) INFO 2026-01-28 01:11:32,716 proxy 172.31.7.228 -- Proxy starting on node c4e8e7cbced234146fc4e47ad45a94f2b04dbb7b4ba03aee4542ca98 (HTTP port: 8000).
INFO 2026-01-28 01:11:32,774 serve 1976714 -- Started Serve in namespace "serve".
(ProxyActor pid=1976778) INFO 2026-01-28 01:11:32,771 proxy 172.31.7.228 -- Got updated endpoints: {}.
❯ serve controller-health --json | jq .
2026-01-28 01:11:38,690	INFO worker.py:1818 -- Connecting to existing Ray cluster at address: 172.31.7.228:6379...
2026-01-28 01:11:38,710	INFO worker.py:1998 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
/home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
{
  "timestamp": 1769562698.7340927,
  "controller_start_time": 1769562692.1990602,
  "uptime_s": 6.535032510757446,
  "num_control_loops": 65,
  "loop_duration_s": {
    "mean": 0.0011578083038330078,
    "std": 0.0012476540718215343,
    "min": 0.0007350444793701172,
    "max": 0.011061429977416992
  },
  "loops_per_second": 9.946392752140438,
  "last_sleep_duration_s": 0.10063314437866211,
  "expected_sleep_duration_s": 0.1,
  "event_loop_delay_s": 0.0006331443786621038,
  "num_asyncio_tasks": 5,
  "deployment_state_update_duration_s": {
    "mean": 1.5174425565279447e-05,
    "std": 1.9936262596089162e-06,
    "min": 8.344650268554688e-06,
    "max": 2.5510787963867188e-05
  },
  "application_state_update_duration_s": {
    "mean": 2.6226043701171875e-06,
    "std": 3.18502181802785e-07,
    "min": 1.9073486328125e-06,
    "max": 3.5762786865234375e-06
  },
  "proxy_state_update_duration_s": {
    "mean": 0.00020424769474909857,
    "std": 0.0013016703229816826,
    "min": 2.2172927856445312e-05,
    "max": 0.010605335235595703
  },
  "node_update_duration_s": {
    "mean": 6.070503821739783e-06,
    "std": 7.122869784728825e-07,
    "min": 2.6226043701171875e-06,
    "max": 7.3909759521484375e-06
  },
  "handle_metrics_delay_ms": {
    "mean": 0,
    "std": 0,
    "min": 0,
    "max": 0
  },
  "replica_metrics_delay_ms": {
    "mean": 0,
    "std": 0,
    "min": 0,
    "max": 0
  },
  "process_memory_mb": 178.44921875
}

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Jan 24, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive health metrics tracking system for the Serve controller. A new ControllerHealthMetricsTracker is added to collect various performance indicators like control loop duration, event loop delay, and memory usage. These metrics are then exposed via a new get_health_metrics method on the controller. The changes are well-implemented and include thorough unit and integration tests.

My main feedback is to improve the cross-platform compatibility of the memory usage calculation, which is currently only correct for Linux. I've left a specific suggestion to address this.

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh marked this pull request as ready for review January 26, 2026 17:53
@abrarsheikh abrarsheikh requested a review from a team as a code owner January 26, 2026 17:53
@ray-gardener ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Jan 26, 2026
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh enabled auto-merge (squash) January 28, 2026 07:59
Signed-off-by: abrar <abrar@anyscale.com>
@github-actions github-actions bot disabled auto-merge January 28, 2026 21:20
@abrarsheikh abrarsheikh merged commit 38571f1 into master Jan 29, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the controller-health branch January 29, 2026 04:54
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
limarkdcunha pushed a commit to limarkdcunha/ray that referenced this pull request Jan 29, 2026
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
Signed-off-by: 400Ping <jiekaichang@apache.org>
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants