feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328) by teknium1 · Pull Request #16936 · NousResearch/hermes-agent

teknium1 · 2026-04-28T09:13:34Z

Summary

Ships computer_use (cua-driver backend) in one shot — foundation #14817 + contributor follow-up #15328 + a safety net for non-Anthropic providers that receive _multimodal tool results. Built on top of these with regression-guard hardening.

Closes #14817, closes #15328.

Commit sequence (rebase-merge preserves per-commit authorship)

@teknium1 — feat(computer-use): cua-driver backend, universal any-model schema (was PR feat(computer-use): cua-driver backend, universal any-model schema #14817)
- tools/computer_use/ package, universal OpenAI function-calling schema, SOM captures so any tool-capable model can drive the desktop.
@ddupont808 — feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection (was PR feat(computer-use): complete cua-driver integration with passing integration tests #15328 commit 1)
- Rewires capture() to list_windows + get_window_state, sticky (pid, window_id), type_text_chars routing, set_value for backgrounded AXPopUpButton / HTML <select>, JPEG MIME detection, regex helpers.
@ddupont808 — fix(computer-use): unwrap _multimodal tool results to content list for non-Anthropic providers (was PR feat(computer-use): complete cua-driver integration with passing integration tests #15328 commit 2)
- Tool-message builder was passing the raw {_multimodal: True, content: [...]} envelope as the content field, which OpenAI-compatible APIs reject. Now unwraps to the OpenAI-style content-parts list at both the parallel and sequential tool-msg build sites.
- Adds a _vision_supported adaptive fallback: on first image-rejection error (e.g. "Only 'text' content type is supported"), the agent strips images from history and retries text-only for the rest of the _run() call.
@teknium1 — fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- _strip_images_from_messages() no longer deletes tool-role messages — replaces their content with a text placeholder to preserve tool_call_id linkage (otherwise providers 400 with "tool_calls without matching tool response").
- Phrase list expanded (image content / multimodal input / vision input / model does not support image).
- 4xx-only gate so transient 5xx/timeout errors never get misclassified as "server rejected images".
- 14 regression tests across two classes:
  - TestStripImagesPreservesAlternation — verifies tool_call_id linkage, content-type coverage, synthetic-message deletion rules, non-dict handling.
  - TestImageRejectionPhraseIsolation — proves our phrase list does NOT false-match on image_too_large, context overflow, or rate-limit error bodies (so those route to the correct existing handlers).
- AUTHOR_MAP: 3820588+ddupont808@users.noreply.github.com → ddupont808.
- tests/tools/test_registry.py updated to include tools.computer_use_tool in the builtin-set.

Interaction with native multimodal routing (#16506)

The native-vision feature already handles the common cases proactively:

Non-vision models → _prepare_messages_for_non_vision_model substitutes images with cached text BEFORE the API call
Anthropic size limit → _try_shrink_image_parts_in_messages reactive shrink
Classified image_too_large → routed to shrink handler via error_classifier

Our _vision_supported fallback is a fourth-layer net for models whose capability detection claims vision support but whose specific deployment rejects images at runtime (mlx-lm, misconfigured proxies, text-only endpoints). The three existing paths run first; ours only fires on 4xx + phrase match.

Phrase isolation is test-guarded: TestImageRejectionPhraseIsolation asserts the phrase list does not false-match on any known image_too_large / context_overflow / rate_limit body.

Validation

Check	Result
`tests/run_agent/test_image_rejection_fallback.py` (new)	14/14 pass
`tests/tools/test_computer_use.py`	57/57 pass
`tests/tools/test_registry.py`	11/11 pass (with new entry)
`tests/agent/test_image_routing.py` + `test_vision_aware_preprocessing.py` + `test_image_shrink_recovery.py` + `test_compressor_image_tokens.py` (native vision feature)	63/63 pass
Broader agent + tools suite (tests/agent/, tests/tools/)	6008 pass, 2 pre-existing failures on `origin/main` unrelated to this PR (`test_custom_base_url`, `test_read_text_file_redacts_sensitive_content`, `test_custom_endpoint_uses_codex_wrapper`)
E2E import smoke test	`_strip_images_from_messages` preserves `tool_call_id` under all tested scenarios; `computer_use` tool registered

Caveats

Requires cua-driver ≥ 0.0.4 for set_value and structuredContent on list_windows/launch_app.
drag is not implemented — cua-driver does not expose a drag tool.
Image-rejection detection is best-effort English phrase matching; locale-translated or heavily reworded upstream errors will bypass the guard and fall through to the normal error handler. Phrase list is extended when new wordings appear in the wild.

@0xbyt4

Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.

…ured windows, MIME detection Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.

…r non-Anthropic providers Tool handlers (e.g. computer_use capture) return a _multimodal envelope dict when a screenshot is attached. The tool-message builder was passing this raw dict as the `content` field of role:tool messages, which is an illegal format — OpenAI-compatible APIs expect a string or a content-parts list, not a plain Python dict, and would reject it with a 400/422 error. Fix: unwrap _multimodal results to their `content` list ([{type:text,...},{type:image_url,...}]) in both the parallel and sequential tool-call paths. The Anthropic adapter already handles content lists natively; vision-capable OpenAI-compatible servers (mlx-vlm, GPT-4o, etc.) accept image_url parts in tool messages directly. Also add a _vision_supported adaptive fallback: on first image-rejection error ("Only 'text' content type is supported." etc.) the agent strips all image parts from the message history and retries with text only, so text-only endpoints degrade gracefully without crashing the session.

Follow-up to #15328's vision-unsupported retry branch in run_agent.py. _strip_images_from_messages() previously deleted any message whose content was entirely images. That's fine for synthetic user messages injected for attachment delivery, but it breaks providers for tool-role messages — the paired tool_call_id on the preceding assistant message ends up unmatched, which OpenAI-compatible APIs reject with HTTP 400. Fix: tool-role messages whose content becomes empty are replaced with a plaintext placeholder that preserves the tool_call_id linkage. Only non-tool messages are dropped. Added 10 tests covering the role-alternation invariants + image-type coverage. Image-rejection detector: expanded phrase list (image content not supported / multimodal input / vision input / model does not support image) and gated on 4xx status so transient 5xx errors never get misinterpreted as 'server said no to images'. Detection is documented as best-effort English phrase matching. AUTHOR_MAP: mapped 3820588+ddupont808@users.noreply.github.com to ddupont808 so release notes attribute the salvage correctly.

                logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
-                logging.debug(f"Tool result ({len(function_result)} chars): {function_result}")
+                _log_result = _multimodal_text_summary(function_result)
+                logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}")


teknium1 · 2026-05-08T18:07:54Z

Shipped via #21967 — re-salvaged onto current main after this branch went stale. All four commits cherry-picked with authorship preserved (2× @teknium1 + 2× @ddupont808). Merged via rebase so per-commit attribution lands in git log.

teknium1 and others added 4 commits April 28, 2026 02:13

github-advanced-security AI found potential problems Apr 28, 2026

View reviewed changes

alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 28, 2026

ddupont808 mentioned this pull request May 1, 2026

feat(computer-use): launch_app action, urls param, hidden-app capture fix (requires cua-driver ≥ 0.1.0) #18519

Open

3 tasks

teknium1 mentioned this pull request May 8, 2026

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936) #21967

Merged

teknium1 closed this in #21967 May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328)#16936

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328)#16936
teknium1 wants to merge 4 commits into
mainfrom
hermes/hermes-c8604b32

teknium1 commented Apr 28, 2026

Uh oh!

teknium1 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

teknium1 commented Apr 28, 2026

Summary

Commit sequence (rebase-merge preserves per-commit authorship)

Interaction with native multimodal routing (#16506)

Validation

Caveats

Uh oh!

teknium1 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants