Skip to content

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328)#16936

Closed
teknium1 wants to merge 4 commits into
mainfrom
hermes/hermes-c8604b32
Closed

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328)#16936
teknium1 wants to merge 4 commits into
mainfrom
hermes/hermes-c8604b32

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Ships computer_use (cua-driver backend) in one shot — foundation #14817 + contributor follow-up #15328 + a safety net for non-Anthropic providers that receive _multimodal tool results. Built on top of these with regression-guard hardening.

Closes #14817, closes #15328.

Commit sequence (rebase-merge preserves per-commit authorship)

  1. @teknium1feat(computer-use): cua-driver backend, universal any-model schema (was PR feat(computer-use): cua-driver backend, universal any-model schema #14817)

    • tools/computer_use/ package, universal OpenAI function-calling schema, SOM captures so any tool-capable model can drive the desktop.
  2. @ddupont808feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection (was PR feat(computer-use): complete cua-driver integration with passing integration tests #15328 commit 1)

    • Rewires capture() to list_windows + get_window_state, sticky (pid, window_id), type_text_chars routing, set_value for backgrounded AXPopUpButton / HTML <select>, JPEG MIME detection, regex helpers.
  3. @ddupont808fix(computer-use): unwrap _multimodal tool results to content list for non-Anthropic providers (was PR feat(computer-use): complete cua-driver integration with passing integration tests #15328 commit 2)

    • Tool-message builder was passing the raw {_multimodal: True, content: [...]} envelope as the content field, which OpenAI-compatible APIs reject. Now unwraps to the OpenAI-style content-parts list at both the parallel and sequential tool-msg build sites.
    • Adds a _vision_supported adaptive fallback: on first image-rejection error (e.g. "Only 'text' content type is supported"), the agent strips images from history and retries text-only for the rest of the _run() call.
  4. @teknium1fix(computer-use): harden image-rejection fallback + AUTHOR_MAP

    • _strip_images_from_messages() no longer deletes tool-role messages — replaces their content with a text placeholder to preserve tool_call_id linkage (otherwise providers 400 with "tool_calls without matching tool response").
    • Phrase list expanded (image content / multimodal input / vision input / model does not support image).
    • 4xx-only gate so transient 5xx/timeout errors never get misclassified as "server rejected images".
    • 14 regression tests across two classes:
      • TestStripImagesPreservesAlternation — verifies tool_call_id linkage, content-type coverage, synthetic-message deletion rules, non-dict handling.
      • TestImageRejectionPhraseIsolation — proves our phrase list does NOT false-match on image_too_large, context overflow, or rate-limit error bodies (so those route to the correct existing handlers).
    • AUTHOR_MAP: 3820588+ddupont808@users.noreply.github.comddupont808.
    • tests/tools/test_registry.py updated to include tools.computer_use_tool in the builtin-set.

Interaction with native multimodal routing (#16506)

The native-vision feature already handles the common cases proactively:

  • Non-vision models → _prepare_messages_for_non_vision_model substitutes images with cached text BEFORE the API call
  • Anthropic size limit → _try_shrink_image_parts_in_messages reactive shrink
  • Classified image_too_large → routed to shrink handler via error_classifier

Our _vision_supported fallback is a fourth-layer net for models whose capability detection claims vision support but whose specific deployment rejects images at runtime (mlx-lm, misconfigured proxies, text-only endpoints). The three existing paths run first; ours only fires on 4xx + phrase match.

Phrase isolation is test-guarded: TestImageRejectionPhraseIsolation asserts the phrase list does not false-match on any known image_too_large / context_overflow / rate_limit body.

Validation

Check Result
tests/run_agent/test_image_rejection_fallback.py (new) 14/14 pass
tests/tools/test_computer_use.py 57/57 pass
tests/tools/test_registry.py 11/11 pass (with new entry)
tests/agent/test_image_routing.py + test_vision_aware_preprocessing.py + test_image_shrink_recovery.py + test_compressor_image_tokens.py (native vision feature) 63/63 pass
Broader agent + tools suite (tests/agent/, tests/tools/) 6008 pass, 2 pre-existing failures on origin/main unrelated to this PR (test_custom_base_url, test_read_text_file_redacts_sensitive_content, test_custom_endpoint_uses_codex_wrapper)
E2E import smoke test _strip_images_from_messages preserves tool_call_id under all tested scenarios; computer_use tool registered

Caveats

  • Requires cua-driver ≥ 0.0.4 for set_value and structuredContent on list_windows/launch_app.
  • drag is not implemented — cua-driver does not expose a drag tool.
  • Image-rejection detection is best-effort English phrase matching; locale-translated or heavily reworded upstream errors will bypass the guard and fall through to the normal error handler. Phrase list is extended when new wordings appear in the wild.

teknium1 and others added 4 commits April 28, 2026 02:13
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
…ured windows, MIME detection

Extends the cua-driver computer-use backend to drive backgrounded macOS
windows without stealing keyboard or mouse focus from the foreground app.
All changes target the cua-driver MCP backend and the shared dispatcher.

## cua_backend.py

**Window-aware capture**: capture() now calls list_windows + get_window_state
instead of the removed capture tool. Prefers structuredContent.windows
(MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back
to regex-parsed text for older builds. Stores the selected (pid, window_id)
as sticky context so subsequent action calls do not need a redundant round-trip.

**Action routing**: click/scroll/type_text/key all carry the sticky pid
(and window_id for element-indexed clicks). type_text routes through
type_text_chars (individual key events) rather than AX attribute write --
WebKit AXTextFields reject attribute writes from backgrounded processes.

**Key parsing**: _parse_key_combo splits cmd+s-style strings into
(key, [modifiers]) and routes to hotkey (modifier present) or
press_key (bare key) -- cua-driver actual tool names.

**set_value method**: new set_value(value, element) calls the cua-driver
set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari,
AXPress opens the native macOS popup which closes immediately when the app is
non-frontmost; set_value AX-presses the matching child option directly
(no menu required, no focus steal).

**focus_app**: reimplemented as a pure window-selector (enumerates
list_windows, sets sticky pid/window_id) without ever raising the window
or stealing focus.

**list_apps**: fixed tool name from listApps to list_apps; handles plain-text
response via regex when structured data is absent.

**Structured-content extraction**: _extract_tool_result now surfaces
structuredContent from MCP results, enabling the list_windows window array
without text parsing.

**Helpers**: _parse_windows_from_text, _parse_elements_from_tree,
_split_tree_text, _parse_key_combo extracted as module-level functions.

## schema.py

Added set_value to the action enum with a description explaining when to
prefer it over click (select/popup elements, sliders, no focus steal).
Added value field for set_value payloads.

## tool.py

Routed set_value action through _dispatch to backend.set_value.
Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated).
Fixed MIME-type detection in _capture_response: cua-driver may return
JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png)
rather than hardcoding image/png.

## agent/display.py + run_agent.py

Guard _detect_tool_failure and result-preview logic against non-string
function_result values: multimodal tool results (dicts with _multimodal=True)
are not string-sliceable; treat them as successes and fall back to str()
for length/preview.
…r non-Anthropic providers

Tool handlers (e.g. computer_use capture) return a _multimodal envelope
dict when a screenshot is attached. The tool-message builder was passing
this raw dict as the `content` field of role:tool messages, which is an
illegal format — OpenAI-compatible APIs expect a string or a content-parts
list, not a plain Python dict, and would reject it with a 400/422 error.

Fix: unwrap _multimodal results to their `content` list
([{type:text,...},{type:image_url,...}]) in both the parallel and
sequential tool-call paths. The Anthropic adapter already handles content
lists natively; vision-capable OpenAI-compatible servers (mlx-vlm,
GPT-4o, etc.) accept image_url parts in tool messages directly.

Also add a _vision_supported adaptive fallback: on first image-rejection
error ("Only 'text' content type is supported." etc.) the agent strips all
image parts from the message history and retries with text only, so
text-only endpoints degrade gracefully without crashing the session.
Follow-up to #15328's vision-unsupported retry branch in run_agent.py.

_strip_images_from_messages() previously deleted any message whose content
was entirely images. That's fine for synthetic user messages injected for
attachment delivery, but it breaks providers for tool-role messages — the
paired tool_call_id on the preceding assistant message ends up unmatched,
which OpenAI-compatible APIs reject with HTTP 400.

Fix: tool-role messages whose content becomes empty are replaced with a
plaintext placeholder that preserves the tool_call_id linkage. Only
non-tool messages are dropped. Added 10 tests covering the role-alternation
invariants + image-type coverage.

Image-rejection detector: expanded phrase list (image content not
supported / multimodal input / vision input / model does not support
image) and gated on 4xx status so transient 5xx errors never get
misinterpreted as 'server said no to images'. Detection is documented as
best-effort English phrase matching.

AUTHOR_MAP: mapped 3820588+ddupont808@users.noreply.github.com to
ddupont808 so release notes attribute the salvage correctly.
Comment thread run_agent.py
logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
logging.debug(f"Tool result ({len(function_result)} chars): {function_result}")
_log_result = _multimodal_text_summary(function_result)
logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}")
@teknium1

teknium1 commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

Shipped via #21967 — re-salvaged onto current main after this branch went stale. All four commits cherry-picked with authorship preserved (2× @teknium1 + 2× @ddupont808). Merged via rebase so per-commit attribution lands in git log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants