Skip to content

[Feature] computer_use: route screenshots through auxiliary.vision when main model lacks vision #29407

@ErnestHysa

Description

@ErnestHysa

Problem: The computer_use tool captures screenshots correctly but cannot describe their visual content when the main model (e.g., MiniMax/M2) lacks vision capability. Screenshots are returned as base64 images in tool results, but _tool_result_content_for_active_model() in run_agent.py:3327 checks _model_supports_vision() on the main model only — it does not route to auxiliary.vision.

Current flow:

computer_use captures screenshot → returns as _multimodal tool result →
fed back to main model (MiniMax/M2) → _model_supports_vision() returns False →
error: \"computer_use returned screenshot/image content, but the active model/provider does not support image input\"

auxiliary.vision only applies to the vision_analyze tool, not computer_use. The computer_use tool results are always processed by the main model, regardless of auxiliary.vision config.

Reproduction:

  1. Set model.default = MiniMax/M2, model.provider = minimax
  2. Configure auxiliary.vision.provider = openrouter, auxiliary.vision.model = nvidia/nemotron-nano-12b-v2-vl:free
  3. Use computer_use with action=capture — screenshot captured successfully
  4. Error returned: main model does not support image input

Proposed fix:
Patch _tool_result_content_for_active_model() (or add a routing check in run_agent.py) so that when:

  • Tool name is computer_use
  • Result has image content (_content_has_image_parts() returns True)
  • Main model does NOT support vision (_model_supports_vision() returns False)
  • auxiliary.vision is configured

Then route the screenshot base64 through resolve_vision_provider_client() instead of returning an error.

Alternative workaround for users: Use browser_vision instead, which correctly routes through auxiliary.vision. Or manually use computer_use capture + send base64 to OpenRouter VL model separately.

Affected area: run_agent.py — tool result handling for multimodal results from non-vision main models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/agentCore agent loop, run_agent.py, prompt builderduplicateThis issue or pull request already existstool/visionVision analysis and image generationtype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions