[Bug] computer_use multimodal tool message causes 400 error on providers that don't support multimodal tool content (e.g. Xiaomi MiMo)

## Description

When using `computer_use` with a vision-capable model (e.g. `xiaomi/mimo-v2.5`), the tool captures a screenshot and returns a `_multimodal` dict with `content` as a list containing both `text` and `image_url` parts. This list is then set as the `content` of the `role: "tool"` message sent to the API.

However, **MiMo's API does not accept list-type content in tool messages** — it requires `content` to be a string for `role: "tool"`. This causes a 400 error:

```
Error code: 400 - {'error': {'code': '400', 'message': 'Param Incorrect', 'param': 'text is not set', 'type': ''}}
```

## Root Cause

In `run_agent.py`, `_tool_result_content_for_active_model()` (line 9621-9659) checks `_model_supports_vision()` to decide whether to pass through the multimodal content or fall back to text summary.

`_model_supports_vision()` (line 9479-9497) uses `agent.models_dev.get_model_capabilities()` which checks `modalities.input` from models.dev. For `mimo-v2.5`, `modalities.input = ['text', 'image', 'audio', 'video']`, so `supports_vision` returns `True`.

**The bug**: `_model_supports_vision()` checks if the model supports images in **user messages**, but doesn't check if it supports images in **tool messages**. These are different things:

- Most OpenAI-compatible providers support images in user messages (via `content` as a list)
- But many providers (including MiMo) require tool message `content` to be a **string**, not a list

The OpenAI API spec says tool message content should be a string. Some providers extend this to support multimodal tool messages (Anthropic, GPT-4o), but MiMo does not.

## Reproduction

1. Configure Hermes with `xiaomi/mimo-v2.5` as the main model
2. Call `computer_use(action='capture', mode='som')`
3. The tool returns `_multimodal` content with image
4. `_tool_result_content_for_active_model` returns the content list (because `supports_vision=True`)
5. The tool message with list content is sent to MiMo API
6. MiMo API returns 400: `text is not set`

## Suggested Fix

Add a provider/model-level flag for `supports_multimodal_tool_content` (or similar) that controls whether multimodal content is allowed in tool messages specifically. Providers that don't support it should always receive string content (the `text_summary` fallback).

Possible approaches:
1. **Provider-specific flag**: Add `supports_multimodal_tool_content = False` to the xiaomi provider profile
2. **Conservative default**: Only use multimodal tool content for providers known to support it (Anthropic, OpenAI), and use text summary for all others
3. **Fallback with retry**: Send multimodal content first; if it fails, retry with text summary (but this wastes a round-trip)

Option 2 is the safest and most backward-compatible.

## Related

- The same issue was previously reported in GitHub issue #27325 (MiMo thinking parameter bug) — that issue is about `reasoning_content` being stripped, which is a different but related MiMo compatibility issue.
- The `_tool_result_content_for_active_model` method already has the right fallback logic for non-vision models (line 9640-9659), but it doesn't apply to vision-capable models that don't support multimodal tool messages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] computer_use multimodal tool message causes 400 error on providers that don't support multimodal tool content (e.g. Xiaomi MiMo) #27344

Description

Root Cause

Reproduction

Suggested Fix

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug] computer_use multimodal tool message causes 400 error on providers that don't support multimodal tool content (e.g. Xiaomi MiMo) #27344

Description

Description

Root Cause

Reproduction

Suggested Fix

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions