Skip to content

[Umbrella][serve] Advanced Observability for Serve Autoscaler #55833

@nadongjun

Description

@nadongjun

Description

This issue tracks the implementation of advanced observability for the Serve Autoscaler,
as proposed in #41135 (comment) and specified in detail in this design document.

The goal is to make it easier to debug scaling behavior by exposing structured logs, metrics, and detailed CLI outputs (serve status -v).

This work depends on the ongoing implementation of the Serve custom autoscaler (deployment-level, application-level, and external scaler).

Each observability feature builds on top of the corresponding autoscaler logic, so the sub-issues should be tackled in order: Skeleton -> Deployment -> Application -> External -> Docs.

Sub-issues

Use case

The current serve status command only shows basic information such as replica counts and health.
As custom autoscaling (deployment-level, application-level, external scalers) becomes available, users need more detailed visibility to understand why scaling decisions are made.

serve status -v will let users:

  • See scaling decisions and the policies/metrics that triggered them.
  • Check metrics freshness (normal vs. delayed).
  • Understand errors or abnormal events during autoscaler operation.
  • Track application-level scaling when multiple deployments scale together.
  • Debug external scaler behavior, e.g. webhook response codes and delivery history.

This extended visibility is essential for debugging complex autoscaling behavior and building confidence in custom scaling logic.

Example Output (from RFC)

$ serve status -v

Example 1: Deployment using Default Autoscaling Policy (queue-length based)

======== Serve Autoscaler status: 2025-08-19T15:05:30Z ========
Deployment status
---------------------------------------------------------------
deployment_default_policy:
    Current replicas: 3
    Target replicas: 5
    Replicas allowed: min=1, max=10
    Scaling status: scaling up
    Scaling decisions:
        2025-08-19T14:00:00Z - scaled down from 5 -> 3 (low traffic)
        2025-08-19T15:05:00Z - scaled up from 3 -> 5 (12 requests queued)
    Policy: Default (queue-length based)
    Metrics (look_back_period_s=30):
        queued_requests: 12
    Metric collection: delayed (last update 30s ago)
    Errors: (none)


Example 2: Deployment using a Custom Autoscaling Policy (latency-based)

======== Serve Autoscaler status: 2025-08-19T12:10:00Z ========
Deployment status
---------------------------------------------------------------
deployment_custom_latency_policy:
    Current replicas: 8
    Target replicas: 8
    Replicas allowed: min=1, max=20
    Scaling status: stable
    Scaling decisions:
        2025-08-19T11:30:00Z - scaled up from 2 -> 4 (cpu_usage_percent 85% > 80%)
        2025-08-19T11:50:00Z - scaled up from 4 -> 8 (latency_p95_ms 450ms > 300ms)
    Policy: Custom (my_custom_policy)
    Metrics (look_back_period_s=60):
        latency_p95_ms: 450.0
        cpu_usage_percent: 62.5
    Metric collection: healthy (last update 5s ago)
    Errors:
        2025-08-19T12:05:00Z - PolicyError: Exception in user policy (ZeroDivisionError) – scaling skipped


Example 3: Deployment using an External Webhook Scaler

======== Serve Autoscaler status: 2025-08-19T04:12:00Z ========
Deployment status
---------------------------------------------------------------
deployment_webhook_policy:
    Current replicas: 5
    Target replicas: 3
    Replicas allowed: min=0, max=10
    Scaling status: scaling down
    Scaling decisions:
        2025-08-19T03:59:00Z - scaled up from 3 -> 5 (external scaler: cpu_usage_percent 92% > 90%)
        2025-08-19T04:10:00Z - scaled down from 5 -> 3 (external scaler: cpu_usage_percent 5% < 10%)
    Policy: External (external scaler)
    Metrics: n/a (decisions made externally) 
    Metric collection: healthy (last update 2s ago)
    Webhook history:
        2025-08-19T03:59:01Z - scale up to 5 replicas (200 OK)
        2025-08-19T04:10:01Z - scale down to 3 replicas (500 ERROR)
    Errors: (none)

Example 4: Application using a Custom Application-Level Policy

======== Serve Autoscaler status: 2025-08-20T10:00:00Z ========
Application status
---------------------------------------------------------------
application_default_policy:
    Scaling status: scaling up
    Policy: Custom (example_application_policy)
    Scaling decisions:
        2025-08-20T09:55:00Z - scaled up frontend: 2 -> 4, backend: 4 -> 6 (total_requests=200)
    Metrics (look_back_period_s=45):
        total_requests: 200
    Errors: (none)

Deployments:
    frontend:
        Current replicas: 4
        Target replicas: 4
        Replicas allowed: min=1, max=10
    backend:
        Current replicas: 6
        Target replicas: 6
        Replicas allowed: min=2, max=20

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalcommunity-backlogdocsAn issue or change related to documentationenhancementRequest for new feature and/or capabilityobservabilityIssues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or ProfilingserveRay Serve Related Issueusability

Type

No type

Projects

Status

Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions