[WIP][Serve] Fix perf regression in autoscaling snapshot cache refresh by nadongjun · Pull Request #61611 · ray-project/ray

nadongjun · 2026-03-10T01:49:56Z

Description

Improved controller performance degradation at scale (2048 replicas: loop_duration +15%, handle_metrics_delay +28%) introduced after PR #56225.

Root causes:

_refresh_autoscaling_deployments_cache() called list_deployment_details(), constructing O(N) ReplicaDetails Pydantic objects per deployment on every control loop
_create_deployment_snapshot() built a 14-field Pydantic DeploymentSnapshot on every autoscaling tick, regardless of whether scaling state had changed

Fixes:

Replaced list_deployment_details() with O(1) get_deployment() lookups
Replaced eager Pydantic construction with tuple key caching (~0.3us) + lazy DeploymentSnapshot construction (only when scaling state changes)
Separated _emit try/except from application state update to prevent error masking
Moved cache refresh after ASM timer to avoid polluting application_state_update_duration_s metrics
Removed dead code (_create_deployment_snapshot, is_scaling_equivalent)

Related issues

Re-lands #56225 (reverted in #61557) with performance fix.

Additional information

Benchmark

bench_fix_validation.py

Measured per-tick cost by importing actual Ray Serve modules (DeploymentAutoscalingState, get_decision_num_replicas(), etc.) without requiring a cluster. bench_fix_validation.py compares three code paths:

A) No snapshot (baseline) — autoscale logic only, no snapshot creation:

ctx = state.get_autoscaling_context(n)
decision, _ = state._policy(ctx)
decision = state.apply_bounds(decision)

B) Original PR #56225 — Pydantic DeploymentSnapshot created every tick:

ctx = state.get_autoscaling_context(n)
decision, _ = state._policy(ctx)
decision = state.apply_bounds(decision)
# 14-field Pydantic model constructed every tick (includes time.strftime, enum lookup)
DeploymentSnapshot(
    timestamp_str=time.strftime(...), app=..., deployment=...,
    current_replicas=..., target_replicas=..., min_replicas=...,
    max_replicas=..., scaling_status=..., policy_name=...,
    look_back_period_s=..., queued_requests=..., ongoing_requests=...,
    metrics_health=..., errors=...,
)

C) This fix — tuple cache + lazy Pydantic:

# Inside get_decision_num_replicas(): only caches tuples
state.get_decision_num_replicas(curr_target_num_replicas=n)
# In _emit(): tuple key comparison, Pydantic only built on state change
key = state.get_cached_snapshot_key()
if key != last_keys.get(dep_id):
    snap = state.get_deployment_snapshot()  # lazy Pydantic construction
    snap.dict(exclude_none=True)

create_state(n_replicas) creates a DeploymentAutoscalingState at 1–2048 replica scale. Each path is measured via time.perf_counter_ns() over 200 warmup + 2000 iterations:

  Replicas   A) No snapshot   B) Original #56225   C) This fix    B vs C
  --------   --------------   ------------------   -----------    ------
         1           3.2us               16.9us          7.0us     2.43x
        64           3.2us               16.5us          6.9us     2.38x
      1024           3.3us               17.0us          7.3us     2.34x
      2048           3.4us               17.1us          7.3us     2.35x

Original PR overhead vs baseline: +405% (17.1us vs 3.4us)
This fix overhead vs baseline: +115% (7.3us vs 3.4us, ~4us added)
2.35x faster than original PR, consistent across all replica counts
O(1) per deployment: cost does not scale with replica count
4us added cost is < 0.01% of controller loop duration at 2048 replicas

Test results

All 11 tests in test_controller.py pass, including 4 snapshot-specific tests.

…arizer (ray-project#56225)   This PR introduces deployment-level autoscaling observability in Serve. The controller now emits a single, structured JSON log line (serve_autoscaling_snapshot) per autoscaling-enabled deployment each control-loop tick. This avoids recomputation in the controller call sites and provides a stable, machine-parsable surface for tooling and debugging. - Add get_observability_snapshot in AutoscalingState and manager wrapper to generate compact snapshots (replica counts, queued/total requests, metric freshness). - Add ServeEventSummarizer to build payloads, reduce duplicate logs, and summarize recent scaling decisions. Logs can be found in controller log files, `e.g. /tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log`. ``` serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]} ``` - Expose the same snapshot data via `serve status -v` and CLI/SDK surfaces. - Aggregate per-app snapshots and external scaler history. - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Dongjun Na <kmu5544616@gmail.com> Co-authored-by: akyang-anyscale <alexyang@anyscale.com> Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

gemini-code-assist

Code Review

This pull request effectively addresses a performance regression in the autoscaling snapshot cache refresh by replacing expensive object constructions with efficient lookups. The introduction of dedicated logging for autoscaling snapshots is a great addition for observability. The code is well-structured, and the new tests are comprehensive.

I have a few minor suggestions to improve maintainability and clarity:

Refactor duplicated code for updating timestamps in autoscaling_state.py.
Simplify the is_scaling_equivalent method in common.py by removing redundant checks.
Streamline the log parsing logic in the test file.

Overall, this is a solid improvement.

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/common.py

python/ray/serve/tests/test_controller.py

python/ray/serve/_private/controller.py

…aling tick Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

python/ray/serve/_private/autoscaling_state.py

python/ray/serve/_private/common.py

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

python/ray/serve/_private/autoscaling_state.py

nadongjun and others added 2 commits March 10, 2026 10:30

[Serve] Fix perf regression in autoscaling snapshot cache refresh

052b017

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

nadongjun requested a review from a team as a code owner March 10, 2026 01:49

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

python/ray/serve/_private/common.py Outdated Show resolved Hide resolved

python/ray/serve/tests/test_controller.py Show resolved Hide resolved

cursor bot reviewed Mar 10, 2026

View reviewed changes

python/ray/serve/_private/controller.py Outdated Show resolved Hide resolved

python/ray/serve/_private/controller.py Show resolved Hide resolved

nadongjun mentioned this pull request Mar 10, 2026

Revert "[Serve][2/N] Add deployment-level autoscaling snapshot and ev… #61557

Merged

ray-gardener bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Mar 10, 2026

Lazy snapshot construction to avoid Pydantic overhead on every autosc…

252f7f4

…aling tick Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Mar 18, 2026

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Outdated Show resolved Hide resolved

python/ray/serve/_private/common.py Outdated Show resolved Hide resolved

Remove dead code: _create_deployment_snapshot and is_scaling_equivalent

76408d5

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

cursor bot reviewed Mar 23, 2026

View reviewed changes

python/ray/serve/_private/autoscaling_state.py Show resolved Hide resolved

nadongjun mentioned this pull request Mar 23, 2026

[Serve][3/N] Add application-level autoscaling snapshot #59995

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Serve] Fix perf regression in autoscaling snapshot cache refresh#61611

[WIP][Serve] Fix perf regression in autoscaling snapshot cache refresh#61611
nadongjun wants to merge 4 commits intoray-project:masterfrom
nadongjun:fix/snapshot-cache-perf-regression

nadongjun commented Mar 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nadongjun commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Benchmark

Test results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nadongjun commented Mar 10, 2026 •

edited

Loading