[Serve] Allow Serve+KubeRay to handle dynamically-created Serve applications smoothly

### Description

This ticket is being created to track issues originally discussed at https://ray-distributed.slack.com/archives/CNCKBBRJL/p1710792087920309 ; I've copied in some of the conversation here to summarize the issue and potential solutions.

The way we’re using Serve is that we have two “static” Serve applications which are in our `start.yaml`/KubeRay `RayService` CR config. One is an HTTP ingress (and is marked as an ingress app), the other is a controller.
* The controller reconciles between Ray Serve and our model registry by dynamically creating and deleting Serve apps (i.e., we call `serve.run(...)` on a `Deployment` with dynamically-set `options` and `bind` parameters) for the models in the registry. 
* The HTTP ingress is the public API for our clients, and it routes requests to those Serve apps (via `serve.get_app_handle(...).remote(...)`).

This obviously isn’t the standard way of using either Serve 1.x or 2.x, but it was working well under 1.x (we were using `Deployment.deploy(...)` and whatnot instead), and it’s working under 2.x, except for some edge cases which are described below.

---

Problem 1: because we can’t have bare deployments under 2.x (without them being attached to apps) anymore, whenever the KubeRay operator does a `PUT /api/serve/applications/` to deploy the Serve app, the logic [here](https://github.com/ray-project/ray/blob/37751f1b66ff6bbff2312d7b4c3929371b1e535a/python/ray/serve/_private/controller.py#L802-L807) deletes all other applications - namely, the dynamic ones that our controller created!

KubeRay doesn’t only send that `PUT` during Ray cluster startup: it also sends it whenever the KubeRay operator restarts, because the state of “have I sent a `PUT` to this cluster yet” is stored in memory [here](https://github.com/ray-project/kuberay/blob/99c5a3ad20be79e643b4c4ab34976aa5855386c2/ray-operator/controllers/ray/rayservice_controller.go#L762-L764) (and [here](https://github.com/ray-project/kuberay/blob/99c5a3ad20be79e643b4c4ab34976aa5855386c2/ray-operator/controllers/ray/rayservice_controller.go#L792-L800) when the `RayService` changes). This is what triggers the problem for us (our KubeRay operator restarted and caused some downtime because it deleted all of our deployed models, which is how we found out about this problem).

What’s tricky here is that I don’t think Serve or KubeRay are doing anything wrong here (except maybe that the “have I sent a `PUT`” cache shouldn’t be in-memory). All of the above is exactly how a reconciling/idempotent operator flow should work, given how the Serve 2.x API treats apps. But it has this unfortunate interaction with dynamic applications.

Our immediate workaround is that we’re hacking our KubeRay operator so that it only sends the `PUT` in [this case](https://github.com/ray-project/kuberay/blob/99c5a3ad20be79e643b4c4ab34976aa5855386c2/ray-operator/controllers/ray/rayservice_controller.go#L780-L790), i.e., on startup/crash. So existing clusters won’t pick up any changes in the Serve config on the `RayService`, but that’s fine because we always roll a new container image version when we change those anyway, so the new cluster will get the new config.

--- 

Problem 2:

When KubeRay is checking whether Serve is ready, it checks the status of all applications that it gets back from the dashboard API, even ones that aren’t in the static `RayService` config (see [this loop](https://github.com/ray-project/kuberay/blob/f3f46cdbc73f26094ecbbd3e049e42c5694d365d/ray-operator/controllers/ray/rayservice_controller.go#L859)). We want KubeRay to ignore these - we’re managing the state of those applications ourselves, and we don’t want one failing to take down the whole cluster (e.g., if a user uploads a malformed model to the registry, we’ll just fail to load it and let the rest of the models keep running).

In particular, we were hitting these two blocks in KubeRay v1.0.0 https://github.com/ray-project/kuberay/blob/dcb5d3daa66175b1ff61f5ef4af90f4b44de2916/ray-operator/controllers/ray/rayservice_controller.go#L819-L842. If a single dynamic deployment was permanently unhealthy, it would mark the cluster as unhealthy in the first block, and unready in the second block. To fix that we patched that whole loop to only consider Serve apps that are present in the static config in the CR.

We’re patching our KubeRay so that it only considers the apps defined in the static `RayService` config there, which again fixes the problem for us - maybe another thing that could be an option in upstream `KubeRay`? Again, we’d be happy to contribute the code for this.

I believe it was `ServiceUnhealthySecondThreshold`, but also just the readiness check (though that didn’t take the cluster down, it just would have prevented rolling out new clusters).

> Would you mind sharing the use cases of dynamic Serve apps?

Our use case is that we want to load models from our model registry into Serve apps (1:1), then call them dynamically from our router ingress. We don’t know how many or what types of models (model types are 1:1 with Deployment types), etc. beforehand, so AFAIK this can’t be expressed through the static serve config on the CR, even if we had a separate controller service updating it, since we’re doing things like passing `.bind` parameters created at runtime.

But even if we could express it statically, we wouldn’t want to, because we also need KubeRay to not care about these dynamic apps, because their health or readiness does not indicate overall cluster health or readiness - we track that ourselves in our model controller and use its health/readiness to aggregate readiness information from the dynamic apps in a more flexible way. As an example, if a user of our platform uploads a malformed model that we fail to load, we want to still consider the Ray cluster and services as healthy/ready, because only that particular model is bad. 

---

After discussing with @edoakes and @kevin85421 on Slack, here's the proposed changes:

1. Add a new configuration option `update_serve_config_in_place` to the `RayService` CRD. When `true` (the default), KubeRay does what it does currently. When `false`, if the `ServiceConfigV2` field changes, KubeRay will instead roll an entirely new `RayCluster` to apply those changes to. This resolves our Problem 1. I can contribute this change in KubeRay.
2. @edoakes suggested that Serve could have an option in the config API that determines whether it will touch applications created by `serve.run`. Whether a Serve app was created via `serve.run` would also be available in the Serve status API. @edoakes offered to do this change (I'm not very familiar with the Serve internals so it would be harder for me to contribute there, though I'll note that it might be better to have an enum that describes the "source" of the app?). Once that Serve change is made, KubeRay can have another new configuration option that would make it A) not touch applications created by `serve.run` when it sends config updates, B) not consider applications created by `serve.run` when determining readiness. This would solve Problem 2. I can contribute this change in KubeRay.

TBD: if we do solution 2, do we even need solution 1? I haven't thought it through yet, but it seems that if KubeRay can be configured to never touch dynamic apps, that would resolve both problems in one go.

TBD: for solution 2, what will the exact new API in Serve be so that KubeRay can uptake it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Allow Serve+KubeRay to handle dynamically-created Serve applications smoothly #44226

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Serve] Allow Serve+KubeRay to handle dynamically-created Serve applications smoothly #44226

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions