Bug Description
Non-vision models (and custom providers whose models aren't in models.dev registry) receive multipart tool results [{type:text,...},{type:image_url,...}] from vision-capable tools (computer_use, vision_analyze). The API rejects these with HTTP 400:
{"code": "400", "message": "Param Incorrect", "param": "`text` is not set"}
Root Cause
In run_agent.py, both the concurrent (~line 11102) and sequential (~line 11521) tool-result paths unconditionally pass through multipart content without checking _model_supports_vision():
# Current code — no vision check
_tool_content = (
function_result["content"]
if _is_multimodal_tool_result(function_result)
else function_result
)
The comment says "Text-only servers that reject images are handled by the adaptive _vision_supported recovery in the API retry loop" — but this recovery does not handle the 'text' is not set error format that Xiaomi MiMo and similar OpenAI-compatible APIs produce.
Affected scenarios
- Custom providers (e.g.
custom:xiaomi-tp): get_model_capabilities() returns None because models.dev doesn't know the custom provider → _model_supports_vision() always returns False
- Known non-vision models (mimo-v2-flash, mimo-v2-pro, deepseek-v4-flash): correctly identified as non-vision, but still receive multipart content
Expected Behavior
Non-vision models should receive plain-text tool results. The vision routing should convert multipart content to text summary before sending to the API:
if _is_multimodal_tool_result(function_result):
if self._model_supports_vision():
_tool_content = function_result["content"] # multipart for vision models
else:
_tool_content = _multimodal_text_summary(function_result) # text-only fallback
else:
_tool_content = function_result
This check needs to be applied in both:
_execute_tool_calls_concurrent (~line 11102)
_execute_tool_calls_sequential (~line 11521)
Broader Suggestion
The _model_supports_vision() check relies on models.dev registry, which doesn't cover custom providers. Consider either:
- A config-level
supports_vision flag per provider/model entry
- A runtime negotiation (send multipart, fall back to text on 400 — but handle this specific error format)
- Auto-detection by sending a probe image on first call
Environment
- macOS 14.8.7, Hermes Agent (current main)
- Providers affected:
xiaomi (mimo-v2-flash), custom:xiaomi-tp (mimo-v2-omni)
- Provider NOT affected:
xiaomi-tp (mimo-v2.5-pro, supports_vision=False but no vision tools used)
Workaround
User-level patch: add _model_supports_vision() check before both content unpacking sites (already applied locally). Requires session restart (.pyc cleanup + /resume).
Bug Description
Non-vision models (and
customproviders whose models aren't in models.dev registry) receive multipart tool results[{type:text,...},{type:image_url,...}]from vision-capable tools (computer_use,vision_analyze). The API rejects these with HTTP 400:{"code": "400", "message": "Param Incorrect", "param": "`text` is not set"}Root Cause
In
run_agent.py, both the concurrent (~line 11102) and sequential (~line 11521) tool-result paths unconditionally pass through multipart content without checking_model_supports_vision():The comment says "Text-only servers that reject images are handled by the adaptive
_vision_supportedrecovery in the API retry loop" — but this recovery does not handle the'text' is not seterror format that Xiaomi MiMo and similar OpenAI-compatible APIs produce.Affected scenarios
custom:xiaomi-tp):get_model_capabilities()returnsNonebecause models.dev doesn't know the custom provider →_model_supports_vision()always returnsFalseExpected Behavior
Non-vision models should receive plain-text tool results. The vision routing should convert multipart content to text summary before sending to the API:
This check needs to be applied in both:
_execute_tool_calls_concurrent(~line 11102)_execute_tool_calls_sequential(~line 11521)Broader Suggestion
The
_model_supports_vision()check relies onmodels.devregistry, which doesn't cover custom providers. Consider either:supports_visionflag per provider/model entryEnvironment
xiaomi(mimo-v2-flash),custom:xiaomi-tp(mimo-v2-omni)xiaomi-tp(mimo-v2.5-pro,supports_vision=Falsebut no vision tools used)Workaround
User-level patch: add
_model_supports_vision()check before both content unpacking sites (already applied locally). Requires session restart (.pyccleanup +/resume).