Problem: The computer_use tool captures screenshots correctly but cannot describe their visual content when the main model (e.g., MiniMax/M2) lacks vision capability. Screenshots are returned as base64 images in tool results, but _tool_result_content_for_active_model() in run_agent.py:3327 checks _model_supports_vision() on the main model only — it does not route to auxiliary.vision.
Current flow:
computer_use captures screenshot → returns as _multimodal tool result →
fed back to main model (MiniMax/M2) → _model_supports_vision() returns False →
error: \"computer_use returned screenshot/image content, but the active model/provider does not support image input\"
auxiliary.vision only applies to the vision_analyze tool, not computer_use. The computer_use tool results are always processed by the main model, regardless of auxiliary.vision config.
Reproduction:
- Set
model.default = MiniMax/M2, model.provider = minimax
- Configure
auxiliary.vision.provider = openrouter, auxiliary.vision.model = nvidia/nemotron-nano-12b-v2-vl:free
- Use computer_use with action=capture — screenshot captured successfully
- Error returned: main model does not support image input
Proposed fix:
Patch _tool_result_content_for_active_model() (or add a routing check in run_agent.py) so that when:
- Tool name is
computer_use
- Result has image content (
_content_has_image_parts() returns True)
- Main model does NOT support vision (
_model_supports_vision() returns False)
- auxiliary.vision is configured
Then route the screenshot base64 through resolve_vision_provider_client() instead of returning an error.
Alternative workaround for users: Use browser_vision instead, which correctly routes through auxiliary.vision. Or manually use computer_use capture + send base64 to OpenRouter VL model separately.
Affected area: run_agent.py — tool result handling for multimodal results from non-vision main models.
Problem: The
computer_usetool captures screenshots correctly but cannot describe their visual content when the main model (e.g., MiniMax/M2) lacks vision capability. Screenshots are returned as base64 images in tool results, but_tool_result_content_for_active_model()inrun_agent.py:3327checks_model_supports_vision()on the main model only — it does not route toauxiliary.vision.Current flow:
auxiliary.vision only applies to the
vision_analyzetool, notcomputer_use. Thecomputer_usetool results are always processed by the main model, regardless of auxiliary.vision config.Reproduction:
model.default = MiniMax/M2,model.provider = minimaxauxiliary.vision.provider = openrouter,auxiliary.vision.model = nvidia/nemotron-nano-12b-v2-vl:freeProposed fix:
Patch
_tool_result_content_for_active_model()(or add a routing check inrun_agent.py) so that when:computer_use_content_has_image_parts()returns True)_model_supports_vision()returns False)Then route the screenshot base64 through
resolve_vision_provider_client()instead of returning an error.Alternative workaround for users: Use
browser_visioninstead, which correctly routes throughauxiliary.vision. Or manually use computer_use capture + send base64 to OpenRouter VL model separately.Affected area:
run_agent.py— tool result handling for multimodal results from non-vision main models.