-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[Data] multimodal vision batch inference skips the image preparation step #56125
Description
What happened + What you expected to happen
When using multimodal/vision models in batch processing, image inputs via url or data: do not work if the message doesn't include a prompt with direct content. (i.e "content": "You are an assistant" vs "content": [...] )
This input with and without the system prompt leads to inconsistent image preparation behaviour:
messages=[
# {"role": "system", "content": "You are an assistant"},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
},
{
"type": "text",
"text": f"Can you describe this image in {row['id']} words?",
},
],
},
],With a system prompt direct content (i.e 'content': 'You are an assistant' rather than an content array) → Model can see image, responses are reasonable, and row messages is preserved and include image_sizes
{'resp': ' In the image we can see there are three people and a dog and there is a text on it.',
'messages': [{'role': 'system', 'content': 'You are an assistant'},
{'role': 'user',
'content': [{'type': 'image',
'image': 'https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9'},
{'type': 'text', 'text': 'Can you describe this image in 1 words?'}]}],
'image_sizes': [[640, 640]],
'prompt': '<|im_start|>System: <end_of_utterance>\nUser:<image>Can you describe this image in 1 words?<end_of_utterance>\nAssistant:',
...
Without a system prompt → Model outputs nonsensical responses (Larger models like google/gemma-3-4b-it will output that it can't see an image.), image_sizes is missing, and messages content is mutated (adding 'text': None and 'image': None). You can add bad data as the image content (i.e inconsistent types or invalid url) and it doesn't fail.
{'resp': ' The image depicts a scene from a historical or fictional narrative, likely involving a battle or conflict between two groups of people. The background is blurred, focusing attention on the foreground, which is a group of people standing in a group. The people are dressed',
'messages': [{'content': [{'image': 'https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9',
'text': None,
'type': 'image'},
{'image': None,
'text': 'Can you describe this image in 1 words?',
'type': 'text'}],
'role': 'user'}],
'prompt': '<|im_start|>User:<image>Can you describe this image in 1 words?<end_of_utterance>\nAssistant:',
...
Versions / Dependencies
python 3.11
ray==2.49.0
vllm==0.10.0
Reproduction script
!pip install "ray==2.49.0"
!pip install "vllm==0.10.0"
import ray
from packaging.version import Version
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig
config = vLLMEngineProcessorConfig(
model_source="HuggingFaceTB/SmolVLM-256M-Instruct",
task_type="generate",
engine_kwargs=dict(
# Skip CUDA graph capturing to reduce startup time.
enforce_eager=True,
# CI uses T4 GPU which does not support bfloat16.
dtype="half",
),
# CI uses T4 GPU which is not supported by vLLM v1 FlashAttn.
runtime_env=dict(
env_vars=dict(
VLLM_USE_V1="0",
),
),
apply_chat_template=True,
has_image=True,
tokenize=True,
detokenize=True,
batch_size=16,
accelerator_type="T4",
concurrency=1,
)
processor = build_llm_processor(
config,
preprocess=lambda row: dict(
messages=[
# {"role": "system", "content": "You are an assistant"},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
},
{
"type": "text",
"text": f"Can you describe this image in {row['id']} words?",
},
],
},
],
sampling_params=dict(
temperature=0.3,
max_tokens=50,
),
),
postprocess=lambda row: {
"resp": row["generated_text"],
**row,
},
)
ds = ray.data.range(3)
ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
ds = processor(ds)
ds = ds.materialize()
outs = ds.take_all()Issue Severity
High: It blocks me from completing my task.