Skip to content

[Data.llm] Fix multimodal image extraction when no system prompt is present#56435

Merged
kouroshHakha merged 3 commits intoray-project:masterfrom
nrghosh:multimodal-fix-data-serialization
Sep 13, 2025
Merged

[Data.llm] Fix multimodal image extraction when no system prompt is present#56435
kouroshHakha merged 3 commits intoray-project:masterfrom
nrghosh:multimodal-fix-data-serialization

Conversation

@nrghosh
Copy link
Copy Markdown
Contributor

@nrghosh nrghosh commented Sep 10, 2025

Added .tolist() conversion just like in ChatTemplateStage to handle both PyArrow and Python objects consistently.

Problem:

  • When messages have mixed content types (system prompt with string + user message with list), Ray uses pickle serialization -> native Python objects
  • When messages have uniform content types (only user messages with lists), Ray uses PyArrow serialization resulting in ListValue/StructValue objects.
  • PrepareImageStage.extract_image_info() method has a hardcoded isinstance(message["content"], list) check that only works with Python lists, not PyArrow objects, causing it to silently skip all image extraction in the uniform case

Fix:

  • Add .tolist() conversion to handle PyArrow objects the same way ChatTemplateStage does -> consistent image extraction and handling regardless of serialization method (prompt vs no prompt).

Why are these changes needed?

PrepareImageStage was failing to extract images when messages had uniform content types (no system prompt), because Ray Data uses PyArrow serialization instead of pickle, and isinstance(pyarrow_obj, list) -> False

Related issue number

Fixes #56125

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Reproduction Script (based on user repro)

Differences

  • ray/vllm version (slight)
  • GPU: L4 instead of T4 (which also means VLLM_USE_V1="1"
#!/usr/bin/env python3
"""
Repro for multimodal vision batch inference bug reported in #56125.

Environment:
ray                                2.49.1
vllm                               0.10.1.1

Usage:
python reproduce_multimodal_bug.py

Reproduces
1. WITH system prompt - should work (image_sizes present, reasonable responses)
2. WITHOUT system prompt - should fail (image_sizes missing, nonsensical responses)

"""

import ray
from ray.data.llm import build_llm_processor, vLLMEngineProcessorConfig


def test_with_system_prompt():
    """Test case WITH system prompt - should work correctly."""
    print("=" * 60)
    print("TESTING WITH SYSTEM PROMPT (should work)")
    print("=" * 60)
    
    config = vLLMEngineProcessorConfig(
        model_source="HuggingFaceTB/SmolVLM-256M-Instruct",
        task_type="generate",
        engine_kwargs=dict(
            # Skip CUDA graph capturing to reduce startup time.
            enforce_eager=True,
            dtype="half",
        ),
        apply_chat_template=True,
        has_image=True,
        tokenize=True,
        detokenize=True,
        batch_size=16,
        accelerator_type="L4",
        concurrency=1,
    )

    processor = build_llm_processor(
        config,
        preprocess=lambda row: dict(
            messages=[
                {"role": "system", "content": "You are an assistant"},  # STRING content
                {
                    "role": "user",
                    "content": [  # LIST content - creates mixed content types
                        {
                            "type": "image",
                            "image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
                        },
                        {
                            "type": "text",
                            "text": f"Can you describe this image in {row['id']} words?",
                        },
                    ],
                },
            ],
            sampling_params=dict(
                temperature=0.3,
                max_tokens=50,
            ),
        ),
        postprocess=lambda row: {
            "resp": row["generated_text"],
            **row,
        },
    )

    ds = ray.data.range(3)
    ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
    ds = processor(ds)
    ds = ds.materialize()
    outs = ds.take_all()
    
    print(f"Results: {len(outs)} items")
    for i, out in enumerate(outs):
        print(f"\nResult {i}:")
        print(f"  image_sizes: {out.get('image_sizes', 'MISSING')}")
        print(f"  messages preserved: {len(out.get('messages', []))} messages")
        if 'messages' in out and len(out['messages']) > 1:
            user_content = out['messages'][1]['content']
            print(f"  user content type: {type(user_content)}")
            if hasattr(user_content, '__len__') and len(user_content) > 0:
                first_item = user_content[0]
                print(f"  first content item: {first_item}")
        print(f"  response: {out['resp'][:100]}...")
    
    return outs


def test_without_system_prompt():
    """Test case WITHOUT system prompt - should fail (before fix)."""
    print("\n" + "=" * 60)
    print("TESTING WITHOUT SYSTEM PROMPT (should fail before fix)")
    print("=" * 60)
    
    config = vLLMEngineProcessorConfig(
        model_source="HuggingFaceTB/SmolVLM-256M-Instruct",
        task_type="generate",
        engine_kwargs=dict(
            # Skip CUDA graph capturing to reduce startup time.
            enforce_eager=True,
            dtype="half",
        ),
        apply_chat_template=True,
        has_image=True,
        tokenize=True,
        detokenize=True,
        batch_size=16,
        accelerator_type="L4",
        concurrency=1,
    )

    processor = build_llm_processor(
        config,
        preprocess=lambda row: dict(
            messages=[
                # NO system prompt - only user message with LIST content
                {
                    "role": "user",
                    "content": [  # Only LIST content - uniform content types
                        {
                            "type": "image",
                            "image": "https://i.scdn.co/image/ab67616d0000b27331efa547b5c271f00bcde9b9",
                        },
                        {
                            "type": "text",
                            "text": f"Can you describe this image in {row['id']} words?",
                        },
                    ],
                },
            ],
            sampling_params=dict(
                temperature=0.3,
                max_tokens=50,
            ),
        ),
        postprocess=lambda row: {
            "resp": row["generated_text"],
            **row,
        },
    )

    ds = ray.data.range(3)
    ds = ds.map(lambda x: {"id": x["id"], "val": x["id"] + 5})
    ds = processor(ds)
    ds = ds.materialize()
    outs = ds.take_all()
    
    print(f"Results: {len(outs)} items")
    for i, out in enumerate(outs):
        print(f"\nResult {i}:")
        print(f"  image_sizes: {out.get('image_sizes', 'MISSING')}")
        print(f"  messages preserved: {len(out.get('messages', []))} messages")
        if 'messages' in out and len(out['messages']) > 0:
            user_content = out['messages'][0]['content']
            print(f"  user content type: {type(user_content)}")
            if hasattr(user_content, '__len__') and len(user_content) > 0:
                first_item = user_content[0]
                print(f"  first content item: {first_item}")
        print(f"  response: {out['resp'][:100]}...")
    
    return outs


def analyze_results(with_system, without_system):
    """Analyze and compare the results."""
    print("\n" + "=" * 60)
    print("ANALYSIS")
    print("=" * 60)
    
    with_result = with_system[0]
    without_result = without_system[0]
    
    print(f"\nWITH system prompt:")
    print(f"   image_sizes present: {'image_sizes' in with_result}")
    print(f"   image_sizes value: {with_result.get('image_sizes', 'MISSING')}")
    print(f"   response preview: {with_result['resp'][:80]}...")
    
    print(f"\nWITHOUT system prompt:")
    print(f"   image_sizes present: {'image_sizes' in without_result}")
    print(f"   image_sizes value: {without_result.get('image_sizes', 'MISSING')}")
    print(f"   response preview: {without_result['resp'][:80]}...")
    
    # Check if bug is reproduced
    bug_reproduced = (
        'image_sizes' in with_result and 
        'image_sizes' not in without_result
    )
    
    print(f"\nBug reproduction: {'CONFIRMED' if bug_reproduced else 'NOT REPRODUCED'}")
    
    if bug_reproduced:
        print("   The bug is reproduced exactly as reported:")
        print("   - WITH system prompt: images work, image_sizes present")
        print("   - WITHOUT system prompt: images fail, image_sizes missing")
        print("   - This confirms our PyArrow vs Python object theory")
    else:
        print("   The bug was not reproduced - either:")
        print("   - The fix is already applied and working")
        print("   - Our theory needs refinement")
    
    return bug_reproduced


def main():
    """Run the full reproduction test."""
    print("REPRODUCING MULTIMODAL VISION BATCH INFERENCE BUG")
    print("Issue #56125: [Data] multimodal vision batch inference skips the image preparation step")
    
    try:
        # Initialize Ray
        ray.init(ignore_reinit_error=True)
        
        # Run both test cases
        print("\nRunning test cases...")
        with_system_results = test_with_system_prompt()
        without_system_results = test_without_system_prompt()
        
        # Analyze results
        bug_reproduced = analyze_results(with_system_results, without_system_results)
        
        print(f"\nCONCLUSION:")
        if bug_reproduced:
            print("   The bug has been reproduced!")
            print("   Can now apply the fix and run this script again to verify it works.")
        else:
            print("   Bug was not reproduced - the fix may be already applied.")
            
    except Exception as e:
        print(f"Error during reproduction: {e}")
        import traceback
        traceback.print_exc()
    finally:
        ray.shutdown()


if __name__ == "__main__":
    main()


…resent

PrepareImageStage was failing to extract images when messages had uniform
content types (no system prompt), because Ray Data uses PyArrow serialization
instead of pickle, and isinstance(pyarrow_obj, list) returns False

Added .tolist() conversion like ChatTemplateStage to handle both PyArrow
and Python objects consistently.

Fixes ray-project#56125

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
@nrghosh nrghosh requested a review from a team September 10, 2025 23:44
@nrghosh nrghosh self-assigned this Sep 10, 2025
@nrghosh nrghosh added data Ray Data-related issues llm go add ONLY when ready to merge, run all tests labels Sep 10, 2025
@nrghosh nrghosh marked this pull request as ready for review September 10, 2025 23:55
@nrghosh nrghosh requested a review from a team as a code owner September 10, 2025 23:55
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Can you add a unittest as well that covers these failure cases ? uniform and non uniform types?

Copy link
Copy Markdown
Contributor Author

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. Can you add a unittest as well that covers these failure cases ? uniform and non uniform types?

yes will re-add tests to ensure metadata doesn't get lost etc and the request is parsed/passed through in both cases (prompt/no-prompt) - without actually evaluating the model output

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Copy link
Copy Markdown
Contributor Author

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed + added tests

@GuyStone
Copy link
Copy Markdown
Contributor

GuyStone commented Sep 12, 2025

Yay amazing, thank you @nrghosh for investigating and fixing this! 🙏

@nrghosh nrghosh requested a review from a team September 12, 2025 21:38
Copy link
Copy Markdown
Contributor Author

@nrghosh nrghosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed + added tests cc @kouroshHakha

pass


# Test that image extraction works consistently with both uniform content types
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you consolidate these tests into one test unit with parametrization so that maintenance is simpler?

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kouroshHakha kouroshHakha merged commit 1028dcc into ray-project:master Sep 13, 2025
5 checks passed
@nrghosh nrghosh deleted the multimodal-fix-data-serialization branch September 13, 2025 00:17
ZacAttack pushed a commit to ZacAttack/ray that referenced this pull request Sep 24, 2025
…resent (ray-project#56435)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: zac <zac@anyscale.com>
marcostephan pushed a commit to marcostephan/ray that referenced this pull request Sep 24, 2025
…resent (ray-project#56435)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Marco Stephan <marco@magic.dev>
dstrodtman pushed a commit to dstrodtman/ray that referenced this pull request Oct 6, 2025
…resent (ray-project#56435)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
…resent (ray-project#56435)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…resent (ray-project#56435)

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests llm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] multimodal vision batch inference skips the image preparation step

3 participants