[Serve] Improve scalability of Serve `DeploymentHandle`s

### What happened + What you expected to happen

More context: https://ray-distributed.slack.com/archives/CNCKBBRJL/p1713194071772759

In a previous issue I described our use of Ray Serve to created dynamic applications/deployments https://github.com/ray-project/ray/issues/44226 . Well, we’re hosting a lot of models, and we just ran into https://github.com/ray-project/ray/blob/9cb1dc9e682a087a32f47838fa02ca35f9b1b6ba/python/ray/serve/_private/client.py#L456-L464 , which says to make an issue :) That was added in https://github.com/ray-project/ray/issues/18980 / https://github.com/ray-project/ray/pull/19162

So we’re doing pretty much exactly what this warns against: getting lots of handles in our ingress application, order one handle per deployed model per ingress replica, which right now is something like ~500 models * 10 ingress replicas. The ingress application routes requests to the model applications via those handles.

I see that the `MAX_CACHED_HANDLES` is only `100`, so we’re definitely blowing past that in each replica https://github.com/ray-project/ray/blob/9cb1dc9e682a087a32f47838fa02ca35f9b1b6ba/python/ray/serve/_private/constants.py#L90

We’re also consuming a significant chunk of the `CONTROLLER_MAX_CONCURRENCY` of `15000`, which I assume means that if we exceed 15k handles they’ll suddenly stop working https://github.com/ray-project/ray/blob/9cb1dc9e682a087a32f47838fa02ca35f9b1b6ba/python/ray/serve/_private/constants.py#L94C1-L94C27

What we actually observed is that `serve.get_app_handle` in our ingress application got really slow. Seems like the Serve controller was too busy to respond to the two `.remote` calls that `get_app_handle` makes to the controller? (See #44782 for some discussion around making those calls `async`.)

In the short term, we’re looking at creating `DeplymentHandle`s manually (without going through `get_app_handle`), because we know the application and deployment name to target already and don’t need to ask the controller anything. That resolves the initial latency of getting the handles, but doesn't fix the problem of the controller getting bogged down with all these tasks that scale with the number of handles (`listen_for_change` and `record_handle_metrics`). The concurrency limits in the Serve Controller will also put a hard block on our ability to scale the number of dynamic apps/deployments we're hosting.

What we expected to happen is that it shouldn't matter how many handles we make - that was wrong because the handles need some state from the controller to do their scheduling! But hopefully it can scale more efficiently than it does right now.


### Versions / Dependencies

Ray 2.9.3, though it looks like this didn't change in Ray 2.10.x
Python 3.10.x

### Reproduction script

Working on this, but TLDR: make a lot of apps/deployments (>100), make a lot (>100) of handles to them in the same process, and observe the load on the Serve Controller.

### Issue Severity

High: It blocks me from completing my task.

	if cache_key in self._evicted_handle_keys:
	logger.warning(
	"You just got a ServeHandle that was evicted from internal "
	"cache. This means you are getting too many ServeHandles in "
	"the same process, this will bring down Serve's performance. "
	"Please post a github issue at "
	"https://github.com/ray-project/ray/issues to let the Serve "
	"team to find workaround for your use case."
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Improve scalability of Serve `DeploymentHandle`s #44784

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve] Improve scalability of Serve DeploymentHandles #44784

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Serve] Improve scalability of Serve `DeploymentHandle`s #44784