[Data][LLM] Extend Ray’s Batch Processing to Support Video Inputs for Multimodal Models

### Description

Currently, Ray’s multimodal/vision model support for batch processing is limited to images only.

🔗 [Documentation reference](https://docs.ray.io/en/latest/data/working-with-llms.html#batch-inference-with-vision-language-model-vlm)

However, libraries like vLLM have multimodal support for video:
- https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file#inference-locally
- https://docs.vllm.ai/en/latest/features/multimodal_inputs.html#video-inputs
- https://github.com/vllm-project/vllm/blob/main/vllm/multimodal/utils.py#L277-L297

###  Problem
Today, if a user wants to batch process videos in Ray they would need to manually download and preprocess videos into frames and feed them as individual images. This preparation and preprocessing should be handled by Ray.

### Proposal
Extend the existing [PrepareImageStage](https://github.com/ray-project/ray/blob/master/python/ray/llm/_internal/batch/stages/prepare_image_stage.py) to handle additional media types, making it a more generalised PrepareMediaStage. (Alternative, could be adding a PrepareVideoStage)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data][LLM] Extend Ray’s Batch Processing to Support Video Inputs for Multimodal Models #56424

Description

Problem

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data][LLM] Extend Ray’s Batch Processing to Support Video Inputs for Multimodal Models #56424

Description

Description

Problem

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions