Skip to content

[Bug] computer_use multimodal tool message causes 400 error on providers that don't support multimodal tool content (e.g. Xiaomi MiMo) #27344

@mmahao

Description

@mmahao

Description

When using computer_use with a vision-capable model (e.g. xiaomi/mimo-v2.5), the tool captures a screenshot and returns a _multimodal dict with content as a list containing both text and image_url parts. This list is then set as the content of the role: "tool" message sent to the API.

However, MiMo's API does not accept list-type content in tool messages — it requires content to be a string for role: "tool". This causes a 400 error:

Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}

Root Cause

In run_agent.py, _tool_result_content_for_active_model() (line 9621-9659) checks _model_supports_vision() to decide whether to pass through the multimodal content or fall back to text summary.

_model_supports_vision() (line 9479-9497) uses agent.models_dev.get_model_capabilities() which checks modalities.input from models.dev. For mimo-v2.5, modalities.input = ['text', 'image', 'audio', 'video'], so supports_vision returns True.

The bug: _model_supports_vision() checks if the model supports images in user messages, but doesn't check if it supports images in tool messages. These are different things:

  • Most OpenAI-compatible providers support images in user messages (via content as a list)
  • But many providers (including MiMo) require tool message content to be a string, not a list

The OpenAI API spec says tool message content should be a string. Some providers extend this to support multimodal tool messages (Anthropic, GPT-4o), but MiMo does not.

Reproduction

  1. Configure Hermes with xiaomi/mimo-v2.5 as the main model
  2. Call computer_use(action='capture', mode='som')
  3. The tool returns _multimodal content with image
  4. _tool_result_content_for_active_model returns the content list (because supports_vision=True)
  5. The tool message with list content is sent to MiMo API
  6. MiMo API returns 400: text is not set

Suggested Fix

Add a provider/model-level flag for supports_multimodal_tool_content (or similar) that controls whether multimodal content is allowed in tool messages specifically. Providers that don't support it should always receive string content (the text_summary fallback).

Possible approaches:

  1. Provider-specific flag: Add supports_multimodal_tool_content = False to the xiaomi provider profile
  2. Conservative default: Only use multimodal tool content for providers known to support it (Anthropic, OpenAI), and use text summary for all others
  3. Fallback with retry: Send multimodal content first; if it fails, retry with text summary (but this wastes a round-trip)

Option 2 is the safest and most backward-compatible.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderprovider/xiaomiXiaomi MiLMtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions