Skip to content

[Data][LLM] Extend Ray’s Batch Processing to Support Video Inputs for Multimodal Models #56424

@GuyStone

Description

@GuyStone

Description

Currently, Ray’s multimodal/vision model support for batch processing is limited to images only.

🔗 Documentation reference

However, libraries like vLLM have multimodal support for video:

Problem

Today, if a user wants to batch process videos in Ray they would need to manually download and preprocess videos into frames and feed them as individual images. This preparation and preprocessing should be handled by Ray.

Proposal

Extend the existing PrepareImageStage to handle additional media types, making it a more generalised PrepareMediaStage. (Alternative, could be adding a PrepareVideoStage)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions