Description
This issue tracks the implementation of advanced observability for the Serve Autoscaler,
as proposed in #41135 (comment) and specified in detail in this design document.
The goal is to make it easier to debug scaling behavior by exposing structured logs, metrics, and detailed CLI outputs (serve status -v).
This work depends on the ongoing implementation of the Serve custom autoscaler (deployment-level, application-level, and external scaler).
Each observability feature builds on top of the corresponding autoscaler logic, so the sub-issues should be tackled in order: Skeleton -> Deployment -> Application -> External -> Docs.
Sub-issues
Use case
The current serve status command only shows basic information such as replica counts and health.
As custom autoscaling (deployment-level, application-level, external scalers) becomes available, users need more detailed visibility to understand why scaling decisions are made.
serve status -v will let users:
- See scaling decisions and the policies/metrics that triggered them.
- Check metrics freshness (normal vs. delayed).
- Understand errors or abnormal events during autoscaler operation.
- Track application-level scaling when multiple deployments scale together.
- Debug external scaler behavior, e.g. webhook response codes and delivery history.
This extended visibility is essential for debugging complex autoscaling behavior and building confidence in custom scaling logic.
Example Output (from RFC)
$ serve status -v
Example 1: Deployment using Default Autoscaling Policy (queue-length based)
======== Serve Autoscaler status: 2025-08-19T15:05:30Z ========
Deployment status
---------------------------------------------------------------
deployment_default_policy:
Current replicas: 3
Target replicas: 5
Replicas allowed: min=1, max=10
Scaling status: scaling up
Scaling decisions:
2025-08-19T14:00:00Z - scaled down from 5 -> 3 (low traffic)
2025-08-19T15:05:00Z - scaled up from 3 -> 5 (12 requests queued)
Policy: Default (queue-length based)
Metrics (look_back_period_s=30):
queued_requests: 12
Metric collection: delayed (last update 30s ago)
Errors: (none)
Example 2: Deployment using a Custom Autoscaling Policy (latency-based)
======== Serve Autoscaler status: 2025-08-19T12:10:00Z ========
Deployment status
---------------------------------------------------------------
deployment_custom_latency_policy:
Current replicas: 8
Target replicas: 8
Replicas allowed: min=1, max=20
Scaling status: stable
Scaling decisions:
2025-08-19T11:30:00Z - scaled up from 2 -> 4 (cpu_usage_percent 85% > 80%)
2025-08-19T11:50:00Z - scaled up from 4 -> 8 (latency_p95_ms 450ms > 300ms)
Policy: Custom (my_custom_policy)
Metrics (look_back_period_s=60):
latency_p95_ms: 450.0
cpu_usage_percent: 62.5
Metric collection: healthy (last update 5s ago)
Errors:
2025-08-19T12:05:00Z - PolicyError: Exception in user policy (ZeroDivisionError) – scaling skipped
Example 3: Deployment using an External Webhook Scaler
======== Serve Autoscaler status: 2025-08-19T04:12:00Z ========
Deployment status
---------------------------------------------------------------
deployment_webhook_policy:
Current replicas: 5
Target replicas: 3
Replicas allowed: min=0, max=10
Scaling status: scaling down
Scaling decisions:
2025-08-19T03:59:00Z - scaled up from 3 -> 5 (external scaler: cpu_usage_percent 92% > 90%)
2025-08-19T04:10:00Z - scaled down from 5 -> 3 (external scaler: cpu_usage_percent 5% < 10%)
Policy: External (external scaler)
Metrics: n/a (decisions made externally)
Metric collection: healthy (last update 2s ago)
Webhook history:
2025-08-19T03:59:01Z - scale up to 5 replicas (200 OK)
2025-08-19T04:10:01Z - scale down to 3 replicas (500 ERROR)
Errors: (none)
Example 4: Application using a Custom Application-Level Policy
======== Serve Autoscaler status: 2025-08-20T10:00:00Z ========
Application status
---------------------------------------------------------------
application_default_policy:
Scaling status: scaling up
Policy: Custom (example_application_policy)
Scaling decisions:
2025-08-20T09:55:00Z - scaled up frontend: 2 -> 4, backend: 4 -> 6 (total_requests=200)
Metrics (look_back_period_s=45):
total_requests: 200
Errors: (none)
Deployments:
frontend:
Current replicas: 4
Target replicas: 4
Replicas allowed: min=1, max=10
backend:
Current replicas: 6
Target replicas: 6
Replicas allowed: min=2, max=20
Description
This issue tracks the implementation of advanced observability for the Serve Autoscaler,
as proposed in #41135 (comment) and specified in detail in this design document.
The goal is to make it easier to debug scaling behavior by exposing structured logs, metrics, and detailed CLI outputs (
serve status -v).This work depends on the ongoing implementation of the Serve custom autoscaler (deployment-level, application-level, and external scaler).
Each observability feature builds on top of the corresponding autoscaler logic, so the sub-issues should be tackled in order: Skeleton -> Deployment -> Application -> External -> Docs.
Sub-issues
serve status -v#55834Use case
The current
serve statuscommand only shows basic information such as replica counts and health.As custom autoscaling (deployment-level, application-level, external scalers) becomes available, users need more detailed visibility to understand why scaling decisions are made.
serve status -vwill let users:This extended visibility is essential for debugging complex autoscaling behavior and building confidence in custom scaling logic.
Example Output (from RFC)
$ serve status -v Example 1: Deployment using Default Autoscaling Policy (queue-length based) ======== Serve Autoscaler status: 2025-08-19T15:05:30Z ======== Deployment status --------------------------------------------------------------- deployment_default_policy: Current replicas: 3 Target replicas: 5 Replicas allowed: min=1, max=10 Scaling status: scaling up Scaling decisions: 2025-08-19T14:00:00Z - scaled down from 5 -> 3 (low traffic) 2025-08-19T15:05:00Z - scaled up from 3 -> 5 (12 requests queued) Policy: Default (queue-length based) Metrics (look_back_period_s=30): queued_requests: 12 Metric collection: delayed (last update 30s ago) Errors: (none) Example 2: Deployment using a Custom Autoscaling Policy (latency-based) ======== Serve Autoscaler status: 2025-08-19T12:10:00Z ======== Deployment status --------------------------------------------------------------- deployment_custom_latency_policy: Current replicas: 8 Target replicas: 8 Replicas allowed: min=1, max=20 Scaling status: stable Scaling decisions: 2025-08-19T11:30:00Z - scaled up from 2 -> 4 (cpu_usage_percent 85% > 80%) 2025-08-19T11:50:00Z - scaled up from 4 -> 8 (latency_p95_ms 450ms > 300ms) Policy: Custom (my_custom_policy) Metrics (look_back_period_s=60): latency_p95_ms: 450.0 cpu_usage_percent: 62.5 Metric collection: healthy (last update 5s ago) Errors: 2025-08-19T12:05:00Z - PolicyError: Exception in user policy (ZeroDivisionError) – scaling skipped Example 3: Deployment using an External Webhook Scaler ======== Serve Autoscaler status: 2025-08-19T04:12:00Z ======== Deployment status --------------------------------------------------------------- deployment_webhook_policy: Current replicas: 5 Target replicas: 3 Replicas allowed: min=0, max=10 Scaling status: scaling down Scaling decisions: 2025-08-19T03:59:00Z - scaled up from 3 -> 5 (external scaler: cpu_usage_percent 92% > 90%) 2025-08-19T04:10:00Z - scaled down from 5 -> 3 (external scaler: cpu_usage_percent 5% < 10%) Policy: External (external scaler) Metrics: n/a (decisions made externally) Metric collection: healthy (last update 2s ago) Webhook history: 2025-08-19T03:59:01Z - scale up to 5 replicas (200 OK) 2025-08-19T04:10:01Z - scale down to 3 replicas (500 ERROR) Errors: (none) Example 4: Application using a Custom Application-Level Policy ======== Serve Autoscaler status: 2025-08-20T10:00:00Z ======== Application status --------------------------------------------------------------- application_default_policy: Scaling status: scaling up Policy: Custom (example_application_policy) Scaling decisions: 2025-08-20T09:55:00Z - scaled up frontend: 2 -> 4, backend: 4 -> 6 (total_requests=200) Metrics (look_back_period_s=45): total_requests: 200 Errors: (none) Deployments: frontend: Current replicas: 4 Target replicas: 4 Replicas allowed: min=1, max=10 backend: Current replicas: 6 Target replicas: 6 Replicas allowed: min=2, max=20