-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Umbrella][serve] Advanced Observability for Serve Autoscaler #55833
Description
Description
This issue tracks the implementation of advanced observability for the Serve Autoscaler,
as proposed in #41135 (comment) and specified in detail in this design document.
The goal is to make it easier to debug scaling behavior by exposing structured logs, metrics, and detailed CLI outputs (serve status -v).
This work depends on the ongoing implementation of the Serve custom autoscaler (deployment-level, application-level, and external scaler).
Each observability feature builds on top of the corresponding autoscaler logic, so the sub-issues should be tackled in order: Skeleton -> Deployment -> Application -> External -> Docs.
Sub-issues
- 1. [WIP] [Serve][Autoscaler] Add Skeleton CLI and Backend API for
serve status -v#55834- Backend API PR: [Serve][1/N] Add autoscaler observability core API schema #55919
- Cli
- 2. Integrate Deployment-level autoscaling metrics and decision history PR: [Serve][2/N] Add deployment-level autoscaling snapshot and event summarizer #56225
- 3. [WIP] Support Application-level custom policy observability PR: [Serve][3/N] Add application-level autoscaling snapshot #59995
- 4. Add External scaler observability
- 5. Update docs with examples of serve status -v outputs, error cases, and troubleshooting
Use case
The current serve status command only shows basic information such as replica counts and health.
As custom autoscaling (deployment-level, application-level, external scalers) becomes available, users need more detailed visibility to understand why scaling decisions are made.
serve status -v will let users:
- See scaling decisions and the policies/metrics that triggered them.
- Check metrics freshness (normal vs. delayed).
- Understand errors or abnormal events during autoscaler operation.
- Track application-level scaling when multiple deployments scale together.
- Debug external scaler behavior, e.g. webhook response codes and delivery history.
This extended visibility is essential for debugging complex autoscaling behavior and building confidence in custom scaling logic.
Example Output (from RFC)
$ serve status -v
Example 1: Deployment using Default Autoscaling Policy (queue-length based)
======== Serve Autoscaler status: 2025-08-19T15:05:30Z ========
Deployment status
---------------------------------------------------------------
deployment_default_policy:
Current replicas: 3
Target replicas: 5
Replicas allowed: min=1, max=10
Scaling status: scaling up
Scaling decisions:
2025-08-19T14:00:00Z - scaled down from 5 -> 3 (low traffic)
2025-08-19T15:05:00Z - scaled up from 3 -> 5 (12 requests queued)
Policy: Default (queue-length based)
Metrics (look_back_period_s=30):
queued_requests: 12
Metric collection: delayed (last update 30s ago)
Errors: (none)
Example 2: Deployment using a Custom Autoscaling Policy (latency-based)
======== Serve Autoscaler status: 2025-08-19T12:10:00Z ========
Deployment status
---------------------------------------------------------------
deployment_custom_latency_policy:
Current replicas: 8
Target replicas: 8
Replicas allowed: min=1, max=20
Scaling status: stable
Scaling decisions:
2025-08-19T11:30:00Z - scaled up from 2 -> 4 (cpu_usage_percent 85% > 80%)
2025-08-19T11:50:00Z - scaled up from 4 -> 8 (latency_p95_ms 450ms > 300ms)
Policy: Custom (my_custom_policy)
Metrics (look_back_period_s=60):
latency_p95_ms: 450.0
cpu_usage_percent: 62.5
Metric collection: healthy (last update 5s ago)
Errors:
2025-08-19T12:05:00Z - PolicyError: Exception in user policy (ZeroDivisionError) – scaling skipped
Example 3: Deployment using an External Webhook Scaler
======== Serve Autoscaler status: 2025-08-19T04:12:00Z ========
Deployment status
---------------------------------------------------------------
deployment_webhook_policy:
Current replicas: 5
Target replicas: 3
Replicas allowed: min=0, max=10
Scaling status: scaling down
Scaling decisions:
2025-08-19T03:59:00Z - scaled up from 3 -> 5 (external scaler: cpu_usage_percent 92% > 90%)
2025-08-19T04:10:00Z - scaled down from 5 -> 3 (external scaler: cpu_usage_percent 5% < 10%)
Policy: External (external scaler)
Metrics: n/a (decisions made externally)
Metric collection: healthy (last update 2s ago)
Webhook history:
2025-08-19T03:59:01Z - scale up to 5 replicas (200 OK)
2025-08-19T04:10:01Z - scale down to 3 replicas (500 ERROR)
Errors: (none)
Example 4: Application using a Custom Application-Level Policy
======== Serve Autoscaler status: 2025-08-20T10:00:00Z ========
Application status
---------------------------------------------------------------
application_default_policy:
Scaling status: scaling up
Policy: Custom (example_application_policy)
Scaling decisions:
2025-08-20T09:55:00Z - scaled up frontend: 2 -> 4, backend: 4 -> 6 (total_requests=200)
Metrics (look_back_period_s=45):
total_requests: 200
Errors: (none)
Deployments:
frontend:
Current replicas: 4
Target replicas: 4
Replicas allowed: min=1, max=10
backend:
Current replicas: 6
Target replicas: 6
Replicas allowed: min=2, max=20Metadata
Metadata
Assignees
Labels
Type
Projects
Status