feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936)#21967
Merged
Merged
Conversation
Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.
…ured windows, MIME detection Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.
…r non-Anthropic providers
Tool handlers (e.g. computer_use capture) return a _multimodal envelope
dict when a screenshot is attached. The tool-message builder was passing
this raw dict as the `content` field of role:tool messages, which is an
illegal format — OpenAI-compatible APIs expect a string or a content-parts
list, not a plain Python dict, and would reject it with a 400/422 error.
Fix: unwrap _multimodal results to their `content` list
([{type:text,...},{type:image_url,...}]) in both the parallel and
sequential tool-call paths. The Anthropic adapter already handles content
lists natively; vision-capable OpenAI-compatible servers (mlx-vlm,
GPT-4o, etc.) accept image_url parts in tool messages directly.
Also add a _vision_supported adaptive fallback: on first image-rejection
error ("Only 'text' content type is supported." etc.) the agent strips all
image parts from the message history and retries with text only, so
text-only endpoints degrade gracefully without crashing the session.
Follow-up to #15328's vision-unsupported retry branch in run_agent.py. _strip_images_from_messages() previously deleted any message whose content was entirely images. That's fine for synthetic user messages injected for attachment delivery, but it breaks providers for tool-role messages — the paired tool_call_id on the preceding assistant message ends up unmatched, which OpenAI-compatible APIs reject with HTTP 400. Fix: tool-role messages whose content becomes empty are replaced with a plaintext placeholder that preserves the tool_call_id linkage. Only non-tool messages are dropped. Added 10 tests covering the role-alternation invariants + image-type coverage. Image-rejection detector: expanded phrase list (image content not supported / multimodal input / vision input / model does not support image) and gated on 4xx status so transient 5xx errors never get misinterpreted as 'server said no to images'. Detection is documented as best-effort English phrase matching. AUTHOR_MAP: mapped 3820588+ddupont808@users.noreply.github.com to ddupont808 so release notes attribute the salvage correctly.
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
invalid-argument-type |
4 |
unresolved-attribute |
3 |
unresolved-import |
3 |
invalid-assignment |
1 |
First entries
tools/computer_use/cua_backend.py:263: [unresolved-attribute] unresolved-attribute: Attribute `call_tool` is not defined on `None` in union `None | Unknown`
tools/computer_use/cua_backend.py:621: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `int` in union `Any | int`
tools/computer_use/cua_backend.py:436: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str`, found `Any | int`
tools/computer_use/cua_backend.py:669: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `tuple[int, int, int, int]`, found `tuple[int, ...]`
tools/computer_use/cua_backend.py:219: [unresolved-import] unresolved-import: Cannot resolve imported module `mcp.client.stdio`
tools/computer_use/tool.py:394: [unresolved-attribute] unresolved-attribute: Object of type `ComputerUseBackend` has no attribute `set_value`
run_agent.py:10717: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["content"]` on object of type `str`
agent/anthropic_adapter.py:1577: [invalid-assignment] invalid-assignment: Object of type `Unknown | list[object]` is not assignable to `list[dict[str, Any]] | None`
run_agent.py:10710: [invalid-argument-type] invalid-argument-type: Argument to function `_append_subdir_hint_to_multimodal` is incorrect: Expected `dict[str, Any]`, found `str | Unknown`
tools/computer_use/cua_backend.py:218: [unresolved-import] unresolved-import: Cannot resolve imported module `mcp`
tests/tools/test_computer_use.py:11: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
✅ Fixed issues (1):
| Rule | Count |
|---|---|
invalid-assignment |
1 |
First entries
agent/anthropic_adapter.py:1537: [invalid-assignment] invalid-assignment: Invalid subscript assignment with key of type `Literal["cache_control"]` and value of type `dict[Unknown, Unknown]` on object of type `dict[str, str]`
Unchanged: 4069 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
| logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s") | ||
| logging.debug(f"Tool result ({len(function_result)} chars): {function_result}") | ||
| _log_result = _multimodal_text_summary(function_result) | ||
| logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}") |
This was referenced May 8, 2026
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships
computer_use(cua-driver backend) on fresh main — re-salvage of PR #16936 after it went stale. Supersedes #4562, #14817, #16936. Credit @ddupont808 (#15328) and @0xbyt4 (#3816).What it does
Background macOS desktop control via cua-driver — no cursor-steal, no focus-steal, no Space switch — with a universal OpenAI function-calling schema so any tool-capable model can drive the desktop, not just Anthropic.
Commit sequence (rebase-merge preserves per-commit authorship)
@teknium1 —
feat(computer-use): cua-driver backend, universal any-model schematools/computer_use/package, universal schema, SOM captures.@ddupont808 —
feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detectioncapture()→list_windows+get_window_state, sticky(pid, window_id),type_text_charsrouting,set_valuefor backgroundedAXPopUpButton/ HTML<select>, JPEG MIME detection, regex helpers.@ddupont808 —
fix(computer-use): unwrap _multimodal tool results to content list for non-Anthropic providers{_multimodal: True, content: [...]}envelopes to OpenAI-style content-parts at parallel + sequential build sites._vision_supportedadaptive fallback: on first image-rejection error, strips images from history and retries text-only for the rest of_run().@teknium1 —
fix(computer-use): harden image-rejection fallback + AUTHOR_MAP_strip_images_from_messages()preservestool_call_idlinkage (replaces tool-role content with text placeholder instead of deleting — otherwise providers 400 on "tool_calls without matching tool response").TestStripImagesPreservesAlternation,TestImageRejectionPhraseIsolation).ddupont808.Conflict resolution (vs stale PR #16936 branch)
Rebased onto current main; 4 conflicts resolved by keeping both sides:
agent/context_compressor.py— kept HEAD's broadernot isinstance(content, str)guard (already catches multimodal dicts), added comment about_multimodalenvelopes.toolsets.py—_HERMES_CORE_TOOLSnow has both kanban entries (HEAD) andcomputer_use(PR).run_agent.py(3 sites) — kept HEAD's"name": function_namefield in tool_msg + PR's_tool_contentunwrap; kept both_tool_guardrails.reset_for_turn()(HEAD) and_vision_supported = True(PR) turn resets.website/docs/reference/toolsets-reference.md— both rows preserved, HEAD's richerimage_gendescription kept.Interaction with native multimodal routing (#16506)
Native-vision handles common cases proactively.
_vision_supportedfallback is a 4th-layer net for models whose capability detection claims vision but whose specific deployment rejects images at runtime (mlx-lm, misconfigured proxies, text-only endpoints). Fires only on 4xx + phrase match. Phrase isolation is test-guarded.Validation
tests/tools/test_computer_use.pytests/run_agent/test_image_rejection_fallback.pytests/tools/test_registry.pytest_image_routing,test_compressor_image_tokens,test_vision_resolved_args)computer_useregistered in_HERMES_CORE_TOOLS, package imports,_is_multimodal_tool_result+_strip_images_from_messages+_vision_supportedhelpers present in run_agentCaveats
set_valueandstructuredContentonlist_windows/launch_app.dragis not implemented — cua-driver does not expose a drag tool.Closes
Closes #16936 (re-salvage onto current main)
Closes #14817 (superseded by #16936 which this supersedes)
Closes #4562 (original macOS-only Anthropic-only version, superseded)