Description
When using computer_use with a vision-capable model (e.g. xiaomi/mimo-v2.5), the tool captures a screenshot and returns a _multimodal dict with content as a list containing both text and image_url parts. This list is then set as the content of the role: "tool" message sent to the API.
However, MiMo's API does not accept list-type content in tool messages — it requires content to be a string for role: "tool". This causes a 400 error:
Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}
Root Cause
In run_agent.py, _tool_result_content_for_active_model() (line 9621-9659) checks _model_supports_vision() to decide whether to pass through the multimodal content or fall back to text summary.
_model_supports_vision() (line 9479-9497) uses agent.models_dev.get_model_capabilities() which checks modalities.input from models.dev. For mimo-v2.5, modalities.input = ['text', 'image', 'audio', 'video'], so supports_vision returns True.
The bug: _model_supports_vision() checks if the model supports images in user messages, but doesn't check if it supports images in tool messages. These are different things:
- Most OpenAI-compatible providers support images in user messages (via
content as a list)
- But many providers (including MiMo) require tool message
content to be a string, not a list
The OpenAI API spec says tool message content should be a string. Some providers extend this to support multimodal tool messages (Anthropic, GPT-4o), but MiMo does not.
Reproduction
- Configure Hermes with
xiaomi/mimo-v2.5 as the main model
- Call
computer_use(action='capture', mode='som')
- The tool returns
_multimodal content with image
_tool_result_content_for_active_model returns the content list (because supports_vision=True)
- The tool message with list content is sent to MiMo API
- MiMo API returns 400:
text is not set
Suggested Fix
Add a provider/model-level flag for supports_multimodal_tool_content (or similar) that controls whether multimodal content is allowed in tool messages specifically. Providers that don't support it should always receive string content (the text_summary fallback).
Possible approaches:
- Provider-specific flag: Add
supports_multimodal_tool_content = False to the xiaomi provider profile
- Conservative default: Only use multimodal tool content for providers known to support it (Anthropic, OpenAI), and use text summary for all others
- Fallback with retry: Send multimodal content first; if it fails, retry with text summary (but this wastes a round-trip)
Option 2 is the safest and most backward-compatible.
Related
Description
When using
computer_usewith a vision-capable model (e.g.xiaomi/mimo-v2.5), the tool captures a screenshot and returns a_multimodaldict withcontentas a list containing bothtextandimage_urlparts. This list is then set as thecontentof therole: "tool"message sent to the API.However, MiMo's API does not accept list-type content in tool messages — it requires
contentto be a string forrole: "tool". This causes a 400 error:Root Cause
In
run_agent.py,_tool_result_content_for_active_model()(line 9621-9659) checks_model_supports_vision()to decide whether to pass through the multimodal content or fall back to text summary._model_supports_vision()(line 9479-9497) usesagent.models_dev.get_model_capabilities()which checksmodalities.inputfrom models.dev. Formimo-v2.5,modalities.input = ['text', 'image', 'audio', 'video'], sosupports_visionreturnsTrue.The bug:
_model_supports_vision()checks if the model supports images in user messages, but doesn't check if it supports images in tool messages. These are different things:contentas a list)contentto be a string, not a listThe OpenAI API spec says tool message content should be a string. Some providers extend this to support multimodal tool messages (Anthropic, GPT-4o), but MiMo does not.
Reproduction
xiaomi/mimo-v2.5as the main modelcomputer_use(action='capture', mode='som')_multimodalcontent with image_tool_result_content_for_active_modelreturns the content list (becausesupports_vision=True)text is not setSuggested Fix
Add a provider/model-level flag for
supports_multimodal_tool_content(or similar) that controls whether multimodal content is allowed in tool messages specifically. Providers that don't support it should always receive string content (thetext_summaryfallback).Possible approaches:
supports_multimodal_tool_content = Falseto the xiaomi provider profileOption 2 is the safest and most backward-compatible.
Related
thinkingparameter — defaults to thinking-enabled, wasting tokens #27325 (MiMo thinking parameter bug) — that issue is aboutreasoning_contentbeing stripped, which is a different but related MiMo compatibility issue._tool_result_content_for_active_modelmethod already has the right fallback logic for non-vision models (line 9640-9659), but it doesn't apply to vision-capable models that don't support multimodal tool messages.