-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] [LLM] Allow vLLM deployments to be shared by sequential processors #52277
Description
Description
Allow sequential Ray Data processor steps to optionally reuse an existing vLLM deployment.
Use case
To do sequential batch inference (e.g. using the output from one LLM completion as a prompt for a second LLM request), the Ray Data LLM API makes it simple to define multiple processors. However, often the only thing changing between sequential steps is the prompt. In this case, it would be ideal to re-use the existing vLLM deployment rather than creating a new instance. Given the current behavior of Ray Data, which manages mapper state and resources for each stage separately, this is not currently supported. This is inconvenient primarily because it greatly increases the resource requirements to support workloads with many sequential processing steps, as each step requires dedicated resources.
I am planning to work around this by deploying the model through Ray Serve and using HttpRequestProcessorConfig instead.