Skip to content

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing #32766

@NivOO5

Description

@NivOO5

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

Summary

The computer_use tool (cua-driver backend) makes overly strong assumptions about the responses from the underlying driver. When list_windows returns no windows (e.g. due to is_on_screen filtering), or returns inconsistent data, the entire tool fails hard with no usable output.

This particularly breaks the ability to use auxiliary vision models with computer_use.

Reproduction

  1. Use a text-only main model (e.g. DeepSeek, GLM, local models, etc.).
  2. Configure an auxiliary vision model (auxiliary.vision).
  3. Enable the computer_use toolset.
  4. Attempt to use computer_use with mode=vision, or allow the agent to use desktop control.

Current Behavior

  • capture(...) always calls list_windows with {"on_screen_only": true} (see tools/computer_use/cua_backend.py:367).
  • If the driver returns zero windows (common when is_on_screen is false for everything), it immediately returns an empty 0x0 result with no image data.
  • There is no fallback path, no retry without the on_screen_only filter, and no client-side best-effort logic.
  • list_apps has similar parsing fragility and can return malformed results.
  • When the MCP connection to cua-driver has issues, the backend can be left in a broken state.

Impact on Auxiliary Vision

This is especially damaging for users running text-only models who rely on auxiliary.vision.

  • mode=vision captures are supposed to return raw screenshot data so the auxiliary vision model can analyze the screen.
  • Because the capture fails before any image is produced, the auxiliary vision model never receives any data.
  • As a result, it is currently not possible to use auxiliary vision models effectively with computer_use.

Expected Behavior

The integration should be resilient:

  • Fall back gracefully when on_screen_only returns no results.
  • Still produce usable (if lower quality) output when the driver behaves sub-optimally.
  • Support auxiliary vision workflows even when the driver’s on-screen detection is imperfect.

Relevant Code

  • tools/computer_use/cua_backend.py:
    • capture() (~366–393): Hard dependency on on_screen_only: true
    • list_apps() (~627–642): Fragile structured/text fallback parsing
    • MCP session handling in _CuaDriverSession
  • tools/computer_use/tool.py

Impact

  • On affected systems, computer_use becomes largely unusable.
  • Text-only models + auxiliary.vision lose desktop control capabilities entirely.
  • The feature is unreliable for anyone who depends on real computer use, not just users hitting edge cases in the driver.

Suggested Improvements

  1. Add a fallback in capture(): if on_screen_only: true returns nothing, retry without the filter and do client-side filtering.
  2. Make list_apps() more robust when parsing driver responses.
  3. Add basic health checks and recovery for the cua-driver MCP connection.
  4. Consider a "best effort" capture mode that is more tolerant of imperfect driver output.
  5. Improve error messages so users (and agents) understand when the driver is the limiting factor.

Additional Context

This was discovered while debugging real-world failures combining text-only models, auxiliary vision routing, and the cua-driver backend. The current design assumes the driver will reliably report on-screen windows, which does not always hold.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/toolsTool registry, model_tools, toolsetstool/visionVision analysis and image generationtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions