feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328)#16936
Closed
teknium1 wants to merge 4 commits into
Closed
feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328)#16936teknium1 wants to merge 4 commits into
teknium1 wants to merge 4 commits into
Conversation
Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.
…ured windows, MIME detection Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.
…r non-Anthropic providers
Tool handlers (e.g. computer_use capture) return a _multimodal envelope
dict when a screenshot is attached. The tool-message builder was passing
this raw dict as the `content` field of role:tool messages, which is an
illegal format — OpenAI-compatible APIs expect a string or a content-parts
list, not a plain Python dict, and would reject it with a 400/422 error.
Fix: unwrap _multimodal results to their `content` list
([{type:text,...},{type:image_url,...}]) in both the parallel and
sequential tool-call paths. The Anthropic adapter already handles content
lists natively; vision-capable OpenAI-compatible servers (mlx-vlm,
GPT-4o, etc.) accept image_url parts in tool messages directly.
Also add a _vision_supported adaptive fallback: on first image-rejection
error ("Only 'text' content type is supported." etc.) the agent strips all
image parts from the message history and retries with text only, so
text-only endpoints degrade gracefully without crashing the session.
Follow-up to #15328's vision-unsupported retry branch in run_agent.py. _strip_images_from_messages() previously deleted any message whose content was entirely images. That's fine for synthetic user messages injected for attachment delivery, but it breaks providers for tool-role messages — the paired tool_call_id on the preceding assistant message ends up unmatched, which OpenAI-compatible APIs reject with HTTP 400. Fix: tool-role messages whose content becomes empty are replaced with a plaintext placeholder that preserves the tool_call_id linkage. Only non-tool messages are dropped. Added 10 tests covering the role-alternation invariants + image-type coverage. Image-rejection detector: expanded phrase list (image content not supported / multimodal input / vision input / model does not support image) and gated on 4xx status so transient 5xx errors never get misinterpreted as 'server said no to images'. Detection is documented as best-effort English phrase matching. AUTHOR_MAP: mapped 3820588+ddupont808@users.noreply.github.com to ddupont808 so release notes attribute the salvage correctly.
| logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s") | ||
| logging.debug(f"Tool result ({len(function_result)} chars): {function_result}") | ||
| _log_result = _multimodal_text_summary(function_result) | ||
| logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}") |
3 tasks
Contributor
Author
|
Shipped via #21967 — re-salvaged onto current main after this branch went stale. All four commits cherry-picked with authorship preserved (2× @teknium1 + 2× @ddupont808). Merged via rebase so per-commit attribution lands in git log. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ships computer_use (cua-driver backend) in one shot — foundation #14817 + contributor follow-up #15328 + a safety net for non-Anthropic providers that receive
_multimodaltool results. Built on top of these with regression-guard hardening.Closes #14817, closes #15328.
Commit sequence (rebase-merge preserves per-commit authorship)
@teknium1 —
feat(computer-use): cua-driver backend, universal any-model schema(was PR feat(computer-use): cua-driver backend, universal any-model schema #14817)tools/computer_use/package, universal OpenAI function-calling schema, SOM captures so any tool-capable model can drive the desktop.@ddupont808 —
feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection(was PR feat(computer-use): complete cua-driver integration with passing integration tests #15328 commit 1)capture()tolist_windows+get_window_state, sticky(pid, window_id),type_text_charsrouting,set_valuefor backgroundedAXPopUpButton/ HTML<select>, JPEG MIME detection, regex helpers.@ddupont808 —
fix(computer-use): unwrap _multimodal tool results to content list for non-Anthropic providers(was PR feat(computer-use): complete cua-driver integration with passing integration tests #15328 commit 2){_multimodal: True, content: [...]}envelope as thecontentfield, which OpenAI-compatible APIs reject. Now unwraps to the OpenAI-style content-parts list at both the parallel and sequential tool-msg build sites._vision_supportedadaptive fallback: on first image-rejection error (e.g. "Only 'text' content type is supported"), the agent strips images from history and retries text-only for the rest of the_run()call.@teknium1 —
fix(computer-use): harden image-rejection fallback + AUTHOR_MAP_strip_images_from_messages()no longer deletestool-role messages — replaces their content with a text placeholder to preservetool_call_idlinkage (otherwise providers 400 with "tool_calls without matching tool response").TestStripImagesPreservesAlternation— verifies tool_call_id linkage, content-type coverage, synthetic-message deletion rules, non-dict handling.TestImageRejectionPhraseIsolation— proves our phrase list does NOT false-match onimage_too_large, context overflow, or rate-limit error bodies (so those route to the correct existing handlers).3820588+ddupont808@users.noreply.github.com→ddupont808.tests/tools/test_registry.pyupdated to includetools.computer_use_toolin the builtin-set.Interaction with native multimodal routing (#16506)
The native-vision feature already handles the common cases proactively:
_prepare_messages_for_non_vision_modelsubstitutes images with cached text BEFORE the API call_try_shrink_image_parts_in_messagesreactive shrinkimage_too_large→ routed to shrink handler via error_classifierOur
_vision_supportedfallback is a fourth-layer net for models whose capability detection claims vision support but whose specific deployment rejects images at runtime (mlx-lm, misconfigured proxies, text-only endpoints). The three existing paths run first; ours only fires on 4xx + phrase match.Phrase isolation is test-guarded:
TestImageRejectionPhraseIsolationasserts the phrase list does not false-match on any knownimage_too_large/context_overflow/rate_limitbody.Validation
tests/run_agent/test_image_rejection_fallback.py(new)tests/tools/test_computer_use.pytests/tools/test_registry.pytests/agent/test_image_routing.py+test_vision_aware_preprocessing.py+test_image_shrink_recovery.py+test_compressor_image_tokens.py(native vision feature)origin/mainunrelated to this PR (test_custom_base_url,test_read_text_file_redacts_sensitive_content,test_custom_endpoint_uses_codex_wrapper)_strip_images_from_messagespreservestool_call_idunder all tested scenarios;computer_usetool registeredCaveats
set_valueandstructuredContentonlist_windows/launch_app.dragis not implemented — cua-driver does not expose a drag tool.