Skip to content

[Serve][3/N] Add application-level autoscaling snapshot#59995

Open
nadongjun wants to merge 19 commits intoray-project:masterfrom
nadongjun:serve-obsv-application
Open

[Serve][3/N] Add application-level autoscaling snapshot#59995
nadongjun wants to merge 19 commits intoray-project:masterfrom
nadongjun:serve-obsv-application

Conversation

@nadongjun
Copy link
Copy Markdown
Contributor

Description

Add application-level autoscaling snapshot support for observability.

This PR extends the existing deployment-level autoscaling snapshot feature (PR #56225) to support application-level autoscaling. When an app-level autoscaling policy is configured, the controller now emits ApplicationSnapshot logs containing aggregated metrics across all deployments in the application.

Related issues

Related to #55833

Additional information

bash % cat /tmp/ray/session_latest/logs/serve/autoscaling_snapshot_6668.log 
{"asctime": "2026-01-09 13:56:19,481", "levelname": "INFO", "message": "{'snapshots': [{'snapshot_type': 'application', 'timestamp_str': '2026-01-09T04:56:19Z', 'app': 'app_snap_1767934578', 'num_deployments': 2, 'total_current_replicas': 0, 'total_target_replicas': 2, 'scaling_status': 'scaling up', 'policy_name': 'ray.serve.tests.test_controller.simple_app_policy_for_test', 'errors': []}]}", "filename": "controller.py", "lineno": 511, "process": 6668, "timestamp_ns": 1767934579481838000}
{"asctime": "2026-01-09 13:56:19,999", "levelname": "INFO", "message": "{'snapshots': [{'snapshot_type': 'application', 'timestamp_str': '2026-01-09T04:56:19Z', 'app': 'app_snap_1767934578', 'num_deployments': 2, 'total_current_replicas': 2, 'total_target_replicas': 2, 'scaling_status': 'stable', 'policy_name': 'ray.serve.tests.test_controller.simple_app_policy_for_test', 'errors': []}]}", "filename": "controller.py", "lineno": 511, "process": 6668, "timestamp_ns": 1767934579999085000}

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun requested a review from a team as a code owner January 9, 2026 05:39
@nadongjun nadongjun changed the title Serve obsv application [Serve][3/N] Add application-level autoscaling snapshot Jan 9, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends autoscaling observability to the application level by introducing ApplicationSnapshot logs. The changes are well-structured, reusing existing patterns from deployment-level snapshots, and include comprehensive tests. I've identified a couple of areas for improvement to enhance code clarity and correctness. Overall, this is a solid addition to Ray Serve's observability features.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@ray-gardener ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jan 9, 2026
@harshit-anyscale harshit-anyscale added the go add ONLY when ready to merge, run all tests label Jan 20, 2026
@harshit-anyscale
Copy link
Copy Markdown
Contributor

@nadongjun can you please resolve the merge conflicts?

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun
Copy link
Copy Markdown
Contributor Author

@harshit-anyscale Resolved and pushed. Thanks!

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
…name/deployment_name

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@harshit-anyscale
Copy link
Copy Markdown
Contributor

@nadongjun test seems to be failing(link), can you take a look at them?

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@harshit-anyscale
Copy link
Copy Markdown
Contributor

@nadongjun can you take a look at the failing anyscale docs builder step & remaining unresolved comments? i think only few are remaining, let me know if you need any help - happy to jump in!

@nadongjun
Copy link
Copy Markdown
Contributor Author

@harshit-anyscale Thanks for the follow-up.

Regarding the errors field: It’s not being populated yet, but I kept it in the data structure for future observability/error-tracking. I think it’s better for future-proofing, but let me know if you’d rather have it removed for now.

Also, the docs build error seems to be resolved in the latest CI.

Ready for another look!

@harshit-anyscale
Copy link
Copy Markdown
Contributor

I think it’s better for future-proofing, but let me know if you’d rather have it removed for now.

@nadongjun I think its better to remove error for now, we can add it later whenever we have a solid plan for it.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun
Copy link
Copy Markdown
Contributor Author

@harshit-anyscale Agreed. I've gone ahead and removed it as suggested.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>

# Conflicts:
#	python/ray/serve/_private/common.py
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Mar 12, 2026
@nadongjun
Copy link
Copy Markdown
Contributor Author

ping

@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Mar 13, 2026
@harshit-anyscale
Copy link
Copy Markdown
Contributor

@nadongjun please fix the merge conflcits.

@nadongjun
Copy link
Copy Markdown
Contributor Author

@harshit-anyscale I'm working on performance fixes in PR #61611. I'll resolve the conflicts and apply those improvements here once that PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants