computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

# computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

## Summary

The `computer_use` tool (cua-driver backend) makes overly strong assumptions about the responses from the underlying driver. When `list_windows` returns no windows (e.g. due to `is_on_screen` filtering), or returns inconsistent data, the entire tool fails hard with no usable output.

This particularly breaks the ability to use **auxiliary vision models** with `computer_use`.

## Reproduction

1. Use a text-only main model (e.g. DeepSeek, GLM, local models, etc.).
2. Configure an auxiliary vision model (`auxiliary.vision`).
3. Enable the `computer_use` toolset.
4. Attempt to use `computer_use` with `mode=vision`, or allow the agent to use desktop control.

## Current Behavior

- `capture(...)` always calls `list_windows` with `{"on_screen_only": true}` (see `tools/computer_use/cua_backend.py:367`).
- If the driver returns zero windows (common when `is_on_screen` is false for everything), it immediately returns an empty `0x0` result with no image data.
- There is no fallback path, no retry without the `on_screen_only` filter, and no client-side best-effort logic.
- `list_apps` has similar parsing fragility and can return malformed results.
- When the MCP connection to cua-driver has issues, the backend can be left in a broken state.

### Impact on Auxiliary Vision

This is especially damaging for users running text-only models who rely on `auxiliary.vision`.

- `mode=vision` captures are supposed to return raw screenshot data so the auxiliary vision model can analyze the screen.
- Because the capture fails before any image is produced, the auxiliary vision model never receives any data.
- As a result, **it is currently not possible to use auxiliary vision models effectively with `computer_use`**.

## Expected Behavior

The integration should be resilient:

- Fall back gracefully when `on_screen_only` returns no results.
- Still produce usable (if lower quality) output when the driver behaves sub-optimally.
- Support auxiliary vision workflows even when the driver’s on-screen detection is imperfect.

## Relevant Code

- `tools/computer_use/cua_backend.py`:
  - `capture()` (~366–393): Hard dependency on `on_screen_only: true`
  - `list_apps()` (~627–642): Fragile structured/text fallback parsing
  - MCP session handling in `_CuaDriverSession`
- `tools/computer_use/tool.py`

## Impact

- On affected systems, `computer_use` becomes largely unusable.
- Text-only models + `auxiliary.vision` lose desktop control capabilities entirely.
- The feature is unreliable for anyone who depends on real computer use, not just users hitting edge cases in the driver.

## Suggested Improvements

1. Add a fallback in `capture()`: if `on_screen_only: true` returns nothing, retry without the filter and do client-side filtering.
2. Make `list_apps()` more robust when parsing driver responses.
3. Add basic health checks and recovery for the cua-driver MCP connection.
4. Consider a "best effort" capture mode that is more tolerant of imperfect driver output.
5. Improve error messages so users (and agents) understand when the driver is the limiting factor.

## Additional Context

This was discovered while debugging real-world failures combining text-only models, auxiliary vision routing, and the cua-driver backend. The current design assumes the driver will reliably report on-screen windows, which does not always hold.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing #32766

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

Summary

Reproduction

Current Behavior

Impact on Auxiliary Vision

Expected Behavior

Relevant Code

Impact

Suggested Improvements

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing #32766

Description

computer_use (cua-driver backend) is too fragile and breaks auxiliary vision routing

Summary

Reproduction

Current Behavior

Impact on Auxiliary Vision

Expected Behavior

Relevant Code

Impact

Suggested Improvements

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions