[Data] [LLM] Allow vLLM deployments to be shared by sequential processors

### Description

Allow sequential [Ray Data processor](https://docs.ray.io/en/latest/data/api/doc/ray.data.llm.Processor.html#ray.data.llm.Processor) steps to optionally reuse an existing vLLM deployment.

### Use case

To do sequential batch inference (e.g. using the output from one LLM completion as a prompt for a second LLM request), the [Ray Data LLM API](https://docs.ray.io/en/latest/data/api/llm.html#large-language-model-llm-api) makes it simple to define multiple processors. However, often the only thing changing between sequential steps is the prompt. In this case, it would be ideal to re-use the existing vLLM deployment rather than creating a new instance. Given the current behavior of Ray Data, which manages mapper state and resources for each stage separately, this is not currently supported. This is inconvenient primarily because it greatly increases the resource requirements to support workloads with many sequential processing steps, as each step requires dedicated resources.

I am planning to work around this by deploying the model through Ray Serve and using HttpRequestProcessorConfig instead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] [LLM] Allow vLLM deployments to be shared by sequential processors #52277

Description

Use case

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] [LLM] Allow vLLM deployments to be shared by sequential processors #52277

Description

Description

Use case

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions