[Data.llm] Fix multimodal image extraction when no system prompt is present by nrghosh · Pull Request #56435 · ray-project/ray

nrghosh · 2025-09-10T23:44:23Z

Added .tolist() conversion just like in ChatTemplateStage to handle both PyArrow and Python objects consistently.

Problem:

When messages have mixed content types (system prompt with string + user message with list), Ray uses pickle serialization -> native Python objects
When messages have uniform content types (only user messages with lists), Ray uses PyArrow serialization resulting in ListValue/StructValue objects.
PrepareImageStage.extract_image_info() method has a hardcoded isinstance(message["content"], list) check that only works with Python lists, not PyArrow objects, causing it to silently skip all image extraction in the uniform case

Fix:

Add .tolist() conversion to handle PyArrow objects the same way ChatTemplateStage does -> consistent image extraction and handling regardless of serialization method (prompt vs no prompt).

Why are these changes needed?

PrepareImageStage was failing to extract images when messages had uniform content types (no system prompt), because Ray Data uses PyArrow serialization instead of pickle, and isinstance(pyarrow_obj, list) -> False

Related issue number

Fixes #56125

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Reproduction Script (based on user repro)

Differences

ray/vllm version (slight)
GPU: L4 instead of T4 (which also means VLLM_USE_V1="1"

#!/usr/bin/env python3
"""
Repro for multimodal vision batch inference bug reported in #56125.

Environment:
ray                                2.49.1
vllm                               0.10.1.1

Usage:
python reproduce_multimodal_bug.py

Reproduces
1. WITH system prompt - should work (image_sizes present, reasonable responses)
2. WITHOUT system prompt - should fail (image_sizes missing, nonsensical responses)

"""

import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig


def test_with_system_prompt():
    """Test case WITH system prompt - should work correctly."""
    print("=" * 60)
    print("TESTING WITH SYSTEM PROMPT (should work)")
    print("=" * 60)
    
    config = vLLMEngineProcessorConfig(
        model_source="HuggingFaceTB/SmolVLM-256M-Instruct",
        task_type="generate",
        engine_kwargs=dict(
            # Skip CUDA graph capturing to reduce startup time.
            enforce_eager=True,
            dtype="half",
        ),
        apply_chat_template=True,
        has_image=True,
        tokenize=True,
        detokenize=True,
        batch_size=16,
        accelerator_type="L4",
        concurrency=1,
    )

    processor = build_llm_processor(
        config,
        preprocess=lambda row: dict(
            messages=[
                {"role": "system", "content": "You are an assistant"},  # STRING content
                {
                    "role": "user",
                    "content": [  # LIST content - creates mixed content types
                        {
                            "type": "image",
                            "image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
                        },
                        {
                            "type": "text",
                            "text": f"Can you describe this image in {row['id']} words?",
                        },
                    ],
                },
            ],
            sampling_params=dict(
                temperature=0.3,
                max_tokens=50,
            ),
        ),
        postprocess=lambda row: {
            "resp": row["generated_text"],
            **row,
        },
    )

    ds = ray.data.range(3)
    ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
    ds = processor(ds)
    ds = ds.materialize()
    outs = ds.take_all()
    
    print(f"Results: {len(outs)} items")
    for i, out in enumerate(outs):
        print(f"\nResult {i}:")
        print(f"  image_sizes: {out.get('image_sizes', 'MISSING')}")
        print(f"  messages preserved: {len(out.get('messages', []))} messages")
        if 'messages' in out and len(out['messages']) > 1:
            user_content = out['messages'][1]['content']
            print(f"  user content type: {type(user_content)}")
            if hasattr(user_content, '__len__') and len(user_content) > 0:
                first_item = user_content[0]
                print(f"  first content item: {first_item}")
        print(f"  response: {out['resp'][:100]}...")
    
    return outs


def test_without_system_prompt():
    """Test case WITHOUT system prompt - should fail (before fix)."""
    print("\n" + "=" * 60)
    print("TESTING WITHOUT SYSTEM PROMPT (should fail before fix)")
    print("=" * 60)
    
    config = vLLMEngineProcessorConfig(
        model_source="HuggingFaceTB/SmolVLM-256M-Instruct",
        task_type="generate",
        engine_kwargs=dict(
            # Skip CUDA graph capturing to reduce startup time.
            enforce_eager=True,
            dtype="half",
        ),
        apply_chat_template=True,
        has_image=True,
        tokenize=True,
        detokenize=True,
        batch_size=16,
        accelerator_type="L4",
        concurrency=1,
    )

    processor = build_llm_processor(
        config,
        preprocess=lambda row: dict(
            messages=[
                # NO system prompt - only user message with LIST content
                {
                    "role": "user",
                    "content": [  # Only LIST content - uniform content types
                        {
                            "type": "image",
                            "image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
                        },
                        {
                            "type": "text",
                            "text": f"Can you describe this image in {row['id']} words?",
                        },
                    ],
                },
            ],
            sampling_params=dict(
                temperature=0.3,
                max_tokens=50,
            ),
        ),
        postprocess=lambda row: {
            "resp": row["generated_text"],
            **row,
        },
    )

    ds = ray.data.range(3)
    ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
    ds = processor(ds)
    ds = ds.materialize()
    outs = ds.take_all()
    
    print(f"Results: {len(outs)} items")
    for i, out in enumerate(outs):
        print(f"\nResult {i}:")
        print(f"  image_sizes: {out.get('image_sizes', 'MISSING')}")
        print(f"  messages preserved: {len(out.get('messages', []))} messages")
        if 'messages' in out and len(out['messages']) > 0:
            user_content = out['messages'][0]['content']
            print(f"  user content type: {type(user_content)}")
            if hasattr(user_content, '__len__') and len(user_content) > 0:
                first_item = user_content[0]
                print(f"  first content item: {first_item}")
        print(f"  response: {out['resp'][:100]}...")
    
    return outs


def analyze_results(with_system, without_system):
    """Analyze and compare the results."""
    print("\n" + "=" * 60)
    print("ANALYSIS")
    print("=" * 60)
    
    with_result = with_system[0]
    without_result = without_system[0]
    
    print(f"\nWITH system prompt:")
    print(f"   image_sizes present: {'image_sizes' in with_result}")
    print(f"   image_sizes value: {with_result.get('image_sizes', 'MISSING')}")
    print(f"   response preview: {with_result['resp'][:80]}...")
    
    print(f"\nWITHOUT system prompt:")
    print(f"   image_sizes present: {'image_sizes' in without_result}")
    print(f"   image_sizes value: {without_result.get('image_sizes', 'MISSING')}")
    print(f"   response preview: {without_result['resp'][:80]}...")
    
    # Check if bug is reproduced
    bug_reproduced = (
        'image_sizes' in with_result and 
        'image_sizes' not in without_result
    )
    
    print(f"\nBug reproduction: {'CONFIRMED' if bug_reproduced else 'NOT REPRODUCED'}")
    
    if bug_reproduced:
        print("   The bug is reproduced exactly as reported:")
        print("   - WITH system prompt: images work, image_sizes present")
        print("   - WITHOUT system prompt: images fail, image_sizes missing")
        print("   - This confirms our PyArrow vs Python object theory")
    else:
        print("   The bug was not reproduced - either:")
        print("   - The fix is already applied and working")
        print("   - Our theory needs refinement")
    
    return bug_reproduced


def main():
    """Run the full reproduction test."""
    print("REPRODUCING MULTIMODAL VISION BATCH INFERENCE BUG")
    print("Issue #56125: [Data] multimodal vision batch inference skips the image preparation step")
    
    try:
        # Initialize Ray
        ray.init(ignore_reinit_error=True)
        
        # Run both test cases
        print("\nRunning test cases...")
        with_system_results = test_with_system_prompt()
        without_system_results = test_without_system_prompt()
        
        # Analyze results
        bug_reproduced = analyze_results(with_system_results, without_system_results)
        
        print(f"\nCONCLUSION:")
        if bug_reproduced:
            print("   The bug has been reproduced!")
            print("   Can now apply the fix and run this script again to verify it works.")
        else:
            print("   Bug was not reproduced - the fix may be already applied.")
            
    except Exception as e:
        print(f"Error during reproduction: {e}")
        import traceback
        traceback.print_exc()
    finally:
        ray.shutdown()


if __name__ == "__main__":
    main()

…resent PrepareImageStage was failing to extract images when messages had uniform content types (no system prompt), because Ray Data uses PyArrow serialization instead of pickle, and isinstance(pyarrow_obj, list) returns False Added .tolist() conversion like ChatTemplateStage to handle both PyArrow and Python objects consistently. Fixes ray-project#56125 Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

kouroshHakha

Awesome. Can you add a unittest as well that covers these failure cases ? uniform and non uniform types?

nrghosh

Awesome. Can you add a unittest as well that covers these failure cases ? uniform and non uniform types?

yes will re-add tests to ensure metadata doesn't get lost etc and the request is parsed/passed through in both cases (prompt/no-prompt) - without actually evaluating the model output

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh

fixed + added tests

GuyStone · 2025-09-12T21:17:40Z

Yay amazing, thank you @nrghosh for investigating and fixing this! 🙏

nrghosh

fixed + added tests cc @kouroshHakha

kouroshHakha · 2025-09-12T22:29:34Z

python/ray/llm/tests/batch/cpu/stages/test_prepare_image_stage.py

            pass


+# Test that image extraction works consistently with both uniform content types


can you consolidate these tests into one test unit with parametrization so that maintenance is simpler?

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

kouroshHakha

LGTM

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Marco Stephan <marco@magic.dev>

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh requested a review from a team September 10, 2025 23:44

nrghosh self-assigned this Sep 10, 2025

nrghosh mentioned this pull request Sep 10, 2025

[Data] multimodal vision batch inference skips the image preparation step #56125

Closed

nrghosh added data Ray Data-related issues llm go add ONLY when ready to merge, run all tests labels Sep 10, 2025

nrghosh marked this pull request as ready for review September 10, 2025 23:55

nrghosh requested a review from a team as a code owner September 10, 2025 23:55

kouroshHakha reviewed Sep 11, 2025

View reviewed changes

nrghosh commented Sep 11, 2025

View reviewed changes

Add unit tests for uniform/non-uniform types

f6a8c9d

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

nrghosh requested a review from kouroshHakha September 11, 2025 21:58

nrghosh commented Sep 11, 2025

View reviewed changes

nrghosh requested a review from a team September 12, 2025 21:38

nrghosh commented Sep 12, 2025

View reviewed changes

kouroshHakha reviewed Sep 12, 2025

View reviewed changes

Parametrize unit tests

474bac0

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

kouroshHakha approved these changes Sep 13, 2025

View reviewed changes

kouroshHakha merged commit 1028dcc into ray-project:master Sep 13, 2025
5 checks passed

nrghosh deleted the multimodal-fix-data-serialization branch September 13, 2025 00:17

ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025

[Data.llm] Fix multimodal image extraction when no system prompt is p…

3f7cbbf

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025

[Data.llm] Fix multimodal image extraction when no system prompt is p…

0b5e85c

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025

[Data.llm] Fix multimodal image extraction when no system prompt is p…

c4f8d72

…resent (ray-project#56435) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data.llm] Fix multimodal image extraction when no system prompt is present#56435

[Data.llm] Fix multimodal image extraction when no system prompt is present#56435
kouroshHakha merged 3 commits intoray-project:masterfrom
nrghosh:multimodal-fix-data-serialization

nrghosh commented Sep 10, 2025 •

edited

Loading

Uh oh!

kouroshHakha left a comment

Uh oh!

nrghosh left a comment •

edited

Loading

Uh oh!

nrghosh left a comment

Uh oh!

GuyStone commented Sep 12, 2025 •

edited

Loading

Uh oh!

nrghosh left a comment

Uh oh!

kouroshHakha Sep 12, 2025

Uh oh!

kouroshHakha left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		pass


		# Test that image extraction works consistently with both uniform content types

Conversation

nrghosh commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Reproduction Script (based on user repro)

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

nrghosh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrghosh left a comment

Choose a reason for hiding this comment

Uh oh!

GuyStone commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nrghosh left a comment

Choose a reason for hiding this comment

Uh oh!

kouroshHakha Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nrghosh commented Sep 10, 2025 •

edited

Loading

nrghosh left a comment •

edited

Loading

GuyStone commented Sep 12, 2025 •

edited

Loading