[Data] multimodal vision batch inference skips the image preparation step

### What happened + What you expected to happen

When using multimodal/vision models in batch processing, image inputs via `url` or `data:` do not work if the message doesn't include a prompt with direct content. (i.e "content": "You are an assistant" vs "content": [...] )

https://docs.ray.io/en/latest/data/working-with-llms.html#batch-inference-with-vision-language-model-vlm

This input with and without the system prompt leads to inconsistent image preparation behaviour:

```python
messages=[
    # {"role": "system", "content": "You are an assistant"},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
            },
            {
                "type": "text",
                "text": f"Can you describe this image in {row['id']} words?",
            },
        ],
    },
],
```

With a system prompt direct content (i.e 'content': 'You are an assistant' rather than an content array) → Model can see image, responses are reasonable, and row `messages` is preserved and include image_sizes
```
{'resp': ' In the image we can see there are three people and a dog and there is a text on it.',
 'messages': [{'role': 'system', 'content': 'You are an assistant'},
  {'role': 'user',
   'content': [{'type': 'image',
     'image': 'https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9'},
    {'type': 'text', 'text': 'Can you describe this image in 1 words?'}]}],
 'image_sizes': [[640, 640]],
 'prompt': '<|im_start|>System: <end_of_utterance>\nUser:<image>Can you describe this image in 1 words?<end_of_utterance>\nAssistant:',
 ...

```

Without a system prompt → Model outputs nonsensical responses (Larger models like `google/gemma-3-4b-it` will output that it can't see an image.), `image_sizes` is missing, and messages content is mutated (adding  `'text': None` and `'image': None`). You can add bad data as the image content (i.e inconsistent types or invalid url) and it doesn't fail.

```
 {'resp': ' The image depicts a scene from a historical or fictional narrative, likely involving a battle or conflict between two groups of people. The background is blurred, focusing attention on the foreground, which is a group of people standing in a group. The people are dressed',
'messages': [{'content': [{'image': 'https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9',
     'text': None,
     'type': 'image'},
    {'image': None,
     'text': 'Can you describe this image in 1 words?',
     'type': 'text'}],
   'role': 'user'}],
 'prompt': '<|im_start|>User:<image>Can you describe this image in 1 words?<end_of_utterance>\nAssistant:',
...
```

### Versions / Dependencies

python 3.11
ray==2.49.0
vllm==0.10.0


### Reproduction script

```python
!pip install "ray==2.49.0"
!pip install "vllm==0.10.0"

import ray
from packaging.version import Version
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig

config = vLLMEngineProcessorConfig(
    model_source="HuggingFaceTB/SmolVLM-256M-Instruct",
    task_type="generate",
    engine_kwargs=dict(
        # Skip CUDA graph capturing to reduce startup time.
        enforce_eager=True,
        # CI uses T4 GPU which does not support bfloat16.
        dtype="half",
    ),
    # CI uses T4 GPU which is not supported by vLLM v1 FlashAttn.
    runtime_env=dict(
        env_vars=dict(
            VLLM_USE_V1="0",
        ),
    ),
    apply_chat_template=True,
    has_image=True,
    tokenize=True,
    detokenize=True,
    batch_size=16,
    accelerator_type="T4",
    concurrency=1,
)

processor = build_llm_processor(
    config,
    preprocess=lambda row: dict(
        messages=[
            # {"role": "system", "content": "You are an assistant"},
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
                    },
                    {
                        "type": "text",
                        "text": f"Can you describe this image in {row['id']} words?",
                    },
                ],
            },
        ],
        sampling_params=dict(
            temperature=0.3,
            max_tokens=50,
        ),
    ),
    postprocess=lambda row: {
        "resp": row["generated_text"],
        **row,
    },
)

ds = ray.data.range(3)
ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
ds = processor(ds)
ds = ds.materialize()
outs = ds.take_all()
```

### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] multimodal vision batch inference skips the image preparation step #56125

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data] multimodal vision batch inference skips the image preparation step #56125

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions