Skip to content

[WIP][Serve] Fix perf regression in autoscaling snapshot cache refresh#61611

Open
nadongjun wants to merge 4 commits intoray-project:masterfrom
nadongjun:fix/snapshot-cache-perf-regression
Open

[WIP][Serve] Fix perf regression in autoscaling snapshot cache refresh#61611
nadongjun wants to merge 4 commits intoray-project:masterfrom
nadongjun:fix/snapshot-cache-perf-regression

Conversation

@nadongjun
Copy link
Copy Markdown
Contributor

@nadongjun nadongjun commented Mar 10, 2026

Description

Improved controller performance degradation at scale (2048 replicas: loop_duration +15%, handle_metrics_delay +28%) introduced after PR #56225.

Root causes:

  1. _refresh_autoscaling_deployments_cache() called list_deployment_details(), constructing O(N) ReplicaDetails Pydantic objects per deployment on every control loop
  2. _create_deployment_snapshot() built a 14-field Pydantic DeploymentSnapshot on every autoscaling tick, regardless of whether scaling state had changed

Fixes:

  1. Replaced list_deployment_details() with O(1) get_deployment() lookups
  2. Replaced eager Pydantic construction with tuple key caching (~0.3us) + lazy DeploymentSnapshot construction (only when scaling state changes)
  3. Separated _emit try/except from application state update to prevent error masking
  4. Moved cache refresh after ASM timer to avoid polluting application_state_update_duration_s metrics
  5. Removed dead code (_create_deployment_snapshot, is_scaling_equivalent)

Related issues

Re-lands #56225 (reverted in #61557) with performance fix.

Additional information

Benchmark

bench_fix_validation.py

Measured per-tick cost by importing actual Ray Serve modules (DeploymentAutoscalingState, get_decision_num_replicas(), etc.) without requiring a cluster. bench_fix_validation.py compares three code paths:

A) No snapshot (baseline) — autoscale logic only, no snapshot creation:

ctx = state.get_autoscaling_context(n)
decision, _ = state._policy(ctx)
decision = state.apply_bounds(decision)

B) Original PR #56225 — Pydantic DeploymentSnapshot created every tick:

ctx = state.get_autoscaling_context(n)
decision, _ = state._policy(ctx)
decision = state.apply_bounds(decision)
# 14-field Pydantic model constructed every tick (includes time.strftime, enum lookup)
DeploymentSnapshot(
    timestamp_str=time.strftime(...), app=..., deployment=...,
    current_replicas=..., target_replicas=..., min_replicas=...,
    max_replicas=..., scaling_status=..., policy_name=...,
    look_back_period_s=..., queued_requests=..., ongoing_requests=...,
    metrics_health=..., errors=...,
)

C) This fix — tuple cache + lazy Pydantic:

# Inside get_decision_num_replicas(): only caches tuples
state.get_decision_num_replicas(curr_target_num_replicas=n)
# In _emit(): tuple key comparison, Pydantic only built on state change
key = state.get_cached_snapshot_key()
if key != last_keys.get(dep_id):
    snap = state.get_deployment_snapshot()  # lazy Pydantic construction
    snap.dict(exclude_none=True)

create_state(n_replicas) creates a DeploymentAutoscalingState at 1–2048 replica scale. Each path is measured via time.perf_counter_ns() over 200 warmup + 2000 iterations:

  Replicas   A) No snapshot   B) Original #56225   C) This fix    B vs C
  --------   --------------   ------------------   -----------    ------
         1           3.2us               16.9us          7.0us     2.43x
        64           3.2us               16.5us          6.9us     2.38x
      1024           3.3us               17.0us          7.3us     2.34x
      2048           3.4us               17.1us          7.3us     2.35x
  • Original PR overhead vs baseline: +405% (17.1us vs 3.4us)
  • This fix overhead vs baseline: +115% (7.3us vs 3.4us, ~4us added)
  • 2.35x faster than original PR, consistent across all replica counts
  • O(1) per deployment: cost does not scale with replica count
  • 4us added cost is < 0.01% of controller loop duration at 2048 replicas

Test results

All 11 tests in test_controller.py pass, including 4 snapshot-specific tests.

nadongjun and others added 2 commits March 10, 2026 10:30
…arizer (ray-project#56225)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

This PR introduces deployment-level autoscaling observability in Serve.
The controller now emits a single, structured JSON log line
(serve_autoscaling_snapshot) per autoscaling-enabled deployment each
control-loop tick.

This avoids recomputation in the controller call sites and provides a
stable, machine-parsable surface for tooling and debugging.

- Add get_observability_snapshot in AutoscalingState and manager wrapper
to generate compact snapshots (replica counts, queued/total requests,
metric freshness).
- Add ServeEventSummarizer to build payloads, reduce duplicate logs, and
summarize recent scaling decisions.

Logs can be found in controller log files, `e.g.
/tmp/ray/session_2025-09-03_21-12-01_095657_13385/logs/serve/controller_13474.log`.

```
serve_autoscaling_snapshot {"ts":"2025-09-04T06:12:11Z","app":"default","deployment":"worker","current_replicas":2,"target_replicas":2,"replicas_allowed":{"min":1,"max":8},"scaling_status":"stable","policy":"default","metrics":{"look_back_period_s":10.0,"queued_requests":0.0,"total_requests":0.0},"metrics_health":"ok","errors":[],"decisions":[{"ts":"2025-09-04T06:12:11Z","from":0,"to":2,"reason":"current=0, proposed=2"},{"ts":"2025-09-04T06:12:11Z","from":2,"to":2,"reason":"current=2, proposed=2"}]}
```
- Expose the same snapshot data via `serve status -v` and CLI/SDK
surfaces.
- Aggregate per-app snapshots and external scaler history.

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Co-authored-by: akyang-anyscale <alexyang@anyscale.com>
Co-authored-by: Abrar Sheikh <abrar2002as@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun requested a review from a team as a code owner March 10, 2026 01:49
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a performance regression in the autoscaling snapshot cache refresh by replacing expensive object constructions with efficient lookups. The introduction of dedicated logging for autoscaling snapshots is a great addition for observability. The code is well-structured, and the new tests are comprehensive.

I have a few minor suggestions to improve maintainability and clarity:

  • Refactor duplicated code for updating timestamps in autoscaling_state.py.
  • Simplify the is_scaling_equivalent method in common.py by removing redundant checks.
  • Streamline the log parsing logic in the test file.

Overall, this is a solid improvement.

…aling tick

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant