computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing
Summary
The computer_use tool (cua-driver backend) makes overly strong assumptions about the responses from the underlying driver. When list_windows returns no windows (e.g. due to is_on_screen filtering), or returns inconsistent data, the entire tool fails hard with no usable output.
This particularly breaks the ability to use auxiliary vision models with computer_use.
Reproduction
- Use a text-only main model (e.g. DeepSeek, GLM, local models, etc.).
- Configure an auxiliary vision model (
auxiliary.vision).
- Enable the
computer_use toolset.
- Attempt to use
computer_use with mode=vision, or allow the agent to use desktop control.
Current Behavior
capture(...) always calls list_windows with {"on_screen_only": true} (see tools/computer_use/cua_backend.py:367).
- If the driver returns zero windows (common when
is_on_screen is false for everything), it immediately returns an empty 0x0 result with no image data.
- There is no fallback path, no retry without the
on_screen_only filter, and no client-side best-effort logic.
list_apps has similar parsing fragility and can return malformed results.
- When the MCP connection to cua-driver has issues, the backend can be left in a broken state.
Impact on Auxiliary Vision
This is especially damaging for users running text-only models who rely on auxiliary.vision.
mode=vision captures are supposed to return raw screenshot data so the auxiliary vision model can analyze the screen.
- Because the capture fails before any image is produced, the auxiliary vision model never receives any data.
- As a result, it is currently not possible to use auxiliary vision models effectively with
computer_use.
Expected Behavior
The integration should be resilient:
- Fall back gracefully when
on_screen_only returns no results.
- Still produce usable (if lower quality) output when the driver behaves sub-optimally.
- Support auxiliary vision workflows even when the driver’s on-screen detection is imperfect.
Relevant Code
tools/computer_use/cua_backend.py:
capture() (~366–393): Hard dependency on on_screen_only: true
list_apps() (~627–642): Fragile structured/text fallback parsing
- MCP session handling in
_CuaDriverSession
tools/computer_use/tool.py
Impact
- On affected systems,
computer_use becomes largely unusable.
- Text-only models +
auxiliary.vision lose desktop control capabilities entirely.
- The feature is unreliable for anyone who depends on real computer use, not just users hitting edge cases in the driver.
Suggested Improvements
- Add a fallback in
capture(): if on_screen_only: true returns nothing, retry without the filter and do client-side filtering.
- Make
list_apps() more robust when parsing driver responses.
- Add basic health checks and recovery for the cua-driver MCP connection.
- Consider a "best effort" capture mode that is more tolerant of imperfect driver output.
- Improve error messages so users (and agents) understand when the driver is the limiting factor.
Additional Context
This was discovered while debugging real-world failures combining text-only models, auxiliary vision routing, and the cua-driver backend. The current design assumes the driver will reliably report on-screen windows, which does not always hold.
computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing
Summary
The
computer_usetool (cua-driver backend) makes overly strong assumptions about the responses from the underlying driver. Whenlist_windowsreturns no windows (e.g. due tois_on_screenfiltering), or returns inconsistent data, the entire tool fails hard with no usable output.This particularly breaks the ability to use auxiliary vision models with
computer_use.Reproduction
auxiliary.vision).computer_usetoolset.computer_usewithmode=vision, or allow the agent to use desktop control.Current Behavior
capture(...)always callslist_windowswith{"on_screen_only": true}(seetools/computer_use/cua_backend.py:367).is_on_screenis false for everything), it immediately returns an empty0x0result with no image data.on_screen_onlyfilter, and no client-side best-effort logic.list_appshas similar parsing fragility and can return malformed results.Impact on Auxiliary Vision
This is especially damaging for users running text-only models who rely on
auxiliary.vision.mode=visioncaptures are supposed to return raw screenshot data so the auxiliary vision model can analyze the screen.computer_use.Expected Behavior
The integration should be resilient:
on_screen_onlyreturns no results.Relevant Code
tools/computer_use/cua_backend.py:capture()(~366–393): Hard dependency onon_screen_only: truelist_apps()(~627–642): Fragile structured/text fallback parsing_CuaDriverSessiontools/computer_use/tool.pyImpact
computer_usebecomes largely unusable.auxiliary.visionlose desktop control capabilities entirely.Suggested Improvements
capture(): ifon_screen_only: truereturns nothing, retry without the filter and do client-side filtering.list_apps()more robust when parsing driver responses.Additional Context
This was discovered while debugging real-world failures combining text-only models, auxiliary vision routing, and the cua-driver backend. The current design assumes the driver will reliably report on-screen windows, which does not always hold.