feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936) by teknium1 · Pull Request #21967 · NousResearch/hermes-agent

teknium1 · 2026-05-08T16:31:05Z

Summary

Ships computer_use (cua-driver backend) on fresh main — re-salvage of PR #16936 after it went stale. Supersedes #4562, #14817, #16936. Credit @ddupont808 (#15328) and @0xbyt4 (#3816).

What it does

Background macOS desktop control via cua-driver — no cursor-steal, no focus-steal, no Space switch — with a universal OpenAI function-calling schema so any tool-capable model can drive the desktop, not just Anthropic.

Commit sequence (rebase-merge preserves per-commit authorship)

@teknium1 — feat(computer-use): cua-driver backend, universal any-model schema
- tools/computer_use/ package, universal schema, SOM captures.
@ddupont808 — feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection
- capture() → list_windows + get_window_state, sticky (pid, window_id), type_text_chars routing, set_value for backgrounded AXPopUpButton / HTML <select>, JPEG MIME detection, regex helpers.
@ddupont808 — fix(computer-use): unwrap _multimodal tool results to content list for non-Anthropic providers
- Tool-message builder unwraps {_multimodal: True, content: [...]} envelopes to OpenAI-style content-parts at parallel + sequential build sites.
- Adds _vision_supported adaptive fallback: on first image-rejection error, strips images from history and retries text-only for the rest of _run().
@teknium1 — fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- _strip_images_from_messages() preserves tool_call_id linkage (replaces tool-role content with text placeholder instead of deleting — otherwise providers 400 on "tool_calls without matching tool response").
- Phrase list expanded; 4xx-only gate so transient 5xx/timeout aren't misclassified.
- 14 regression tests (TestStripImagesPreservesAlternation, TestImageRejectionPhraseIsolation).
- AUTHOR_MAP entry for ddupont808.

Conflict resolution (vs stale PR #16936 branch)

Rebased onto current main; 4 conflicts resolved by keeping both sides:

agent/context_compressor.py — kept HEAD's broader not isinstance(content, str) guard (already catches multimodal dicts), added comment about _multimodal envelopes.
toolsets.py — _HERMES_CORE_TOOLS now has both kanban entries (HEAD) and computer_use (PR).
run_agent.py (3 sites) — kept HEAD's "name": function_name field in tool_msg + PR's _tool_content unwrap; kept both _tool_guardrails.reset_for_turn() (HEAD) and _vision_supported = True (PR) turn resets.
website/docs/reference/toolsets-reference.md — both rows preserved, HEAD's richer image_gen description kept.

Interaction with native multimodal routing (#16506)

Native-vision handles common cases proactively. _vision_supported fallback is a 4th-layer net for models whose capability detection claims vision but whose specific deployment rejects images at runtime (mlx-lm, misconfigured proxies, text-only endpoints). Fires only on 4xx + phrase match. Phrase isolation is test-guarded.

Validation

Check	Result
`tests/tools/test_computer_use.py`	57/57 pass
`tests/run_agent/test_image_rejection_fallback.py`	14/14 pass
`tests/tools/test_registry.py`	18/18 pass
Adjacent vision suites (`test_image_routing`, `test_compressor_image_tokens`, `test_vision_resolved_args`)	45/45 pass
E2E: `computer_use` registered in `_HERMES_CORE_TOOLS`, package imports, `_is_multimodal_tool_result` + `_strip_images_from_messages` + `_vision_supported` helpers present in run_agent	pass
Syntax check on run_agent.py / toolsets.py / context_compressor.py / model_tools.py / cli.py	pass

Caveats

Requires cua-driver ≥ 0.0.4 for set_value and structuredContent on list_windows/launch_app.
drag is not implemented — cua-driver does not expose a drag tool.
Image-rejection detection is best-effort English phrase matching; locale-translated wordings will bypass the guard and fall through to the normal error handler.

Closes

Closes #16936 (re-salvage onto current main)
Closes #14817 (superseded by #16936 which this supersedes)
Closes #4562 (original macOS-only Anthropic-only version, superseded)

@0xbyt4

Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.

…ured windows, MIME detection Extends the cua-driver computer-use backend to drive backgrounded macOS windows without stealing keyboard or mouse focus from the foreground app. All changes target the cua-driver MCP backend and the shared dispatcher. ## cua_backend.py **Window-aware capture**: capture() now calls list_windows + get_window_state instead of the removed capture tool. Prefers structuredContent.windows (MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back to regex-parsed text for older builds. Stores the selected (pid, window_id) as sticky context so subsequent action calls do not need a redundant round-trip. **Action routing**: click/scroll/type_text/key all carry the sticky pid (and window_id for element-indexed clicks). type_text routes through type_text_chars (individual key events) rather than AX attribute write -- WebKit AXTextFields reject attribute writes from backgrounded processes. **Key parsing**: _parse_key_combo splits cmd+s-style strings into (key, [modifiers]) and routes to hotkey (modifier present) or press_key (bare key) -- cua-driver actual tool names. **set_value method**: new set_value(value, element) calls the cua-driver set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari, AXPress opens the native macOS popup which closes immediately when the app is non-frontmost; set_value AX-presses the matching child option directly (no menu required, no focus steal). **focus_app**: reimplemented as a pure window-selector (enumerates list_windows, sets sticky pid/window_id) without ever raising the window or stealing focus. **list_apps**: fixed tool name from listApps to list_apps; handles plain-text response via regex when structured data is absent. **Structured-content extraction**: _extract_tool_result now surfaces structuredContent from MCP results, enabling the list_windows window array without text parsing. **Helpers**: _parse_windows_from_text, _parse_elements_from_tree, _split_tree_text, _parse_key_combo extracted as module-level functions. ## schema.py Added set_value to the action enum with a description explaining when to prefer it over click (select/popup elements, sliders, no focus steal). Added value field for set_value payloads. ## tool.py Routed set_value action through _dispatch to backend.set_value. Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated). Fixed MIME-type detection in _capture_response: cua-driver may return JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png) rather than hardcoding image/png. ## agent/display.py + run_agent.py Guard _detect_tool_failure and result-preview logic against non-string function_result values: multimodal tool results (dicts with _multimodal=True) are not string-sliceable; treat them as successes and fall back to str() for length/preview.

…r non-Anthropic providers Tool handlers (e.g. computer_use capture) return a _multimodal envelope dict when a screenshot is attached. The tool-message builder was passing this raw dict as the `content` field of role:tool messages, which is an illegal format — OpenAI-compatible APIs expect a string or a content-parts list, not a plain Python dict, and would reject it with a 400/422 error. Fix: unwrap _multimodal results to their `content` list ([{type:text,...},{type:image_url,...}]) in both the parallel and sequential tool-call paths. The Anthropic adapter already handles content lists natively; vision-capable OpenAI-compatible servers (mlx-vlm, GPT-4o, etc.) accept image_url parts in tool messages directly. Also add a _vision_supported adaptive fallback: on first image-rejection error ("Only 'text' content type is supported." etc.) the agent strips all image parts from the message history and retries with text only, so text-only endpoints degrade gracefully without crashing the session.

Follow-up to #15328's vision-unsupported retry branch in run_agent.py. _strip_images_from_messages() previously deleted any message whose content was entirely images. That's fine for synthetic user messages injected for attachment delivery, but it breaks providers for tool-role messages — the paired tool_call_id on the preceding assistant message ends up unmatched, which OpenAI-compatible APIs reject with HTTP 400. Fix: tool-role messages whose content becomes empty are replaced with a plaintext placeholder that preserves the tool_call_id linkage. Only non-tool messages are dropped. Added 10 tests covering the role-alternation invariants + image-type coverage. Image-rejection detector: expanded phrase list (image content not supported / multimodal input / vision input / model does not support image) and gated on 4xx status so transient 5xx errors never get misinterpreted as 'server said no to images'. Detection is documented as best-effort English phrase matching. AUTHOR_MAP: mapped 3820588+ddupont808@users.noreply.github.com to ddupont808 so release notes attribute the salvage correctly.

github-actions · 2026-05-08T16:32:11Z

🔎 Lint report: `hermes/hermes-c164f8cb` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7779 on HEAD, 7766 on base (🆕 +13)

🆕 New issues (11):

Rule	Count
`invalid-argument-type`	4
`unresolved-attribute`	3
`unresolved-import`	3
`invalid-assignment`	1

First entries

tools/computer_use/cua_backend.py:263: [unresolved-attribute] unresolved-attribute: Attribute `call_tool` is not defined on `None` in union `None | Unknown`
tools/computer_use/cua_backend.py:621: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `int` in union `Any | int`
tools/computer_use/cua_backend.py:436: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str`, found `Any | int`
tools/computer_use/cua_backend.py:669: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `tuple[int, int, int, int]`, found `tuple[int, ...]`
tools/computer_use/cua_backend.py:219: [unresolved-import] unresolved-import: Cannot resolve imported module `mcp.client.stdio`
tools/computer_use/tool.py:394: [unresolved-attribute] unresolved-attribute: Object of type `ComputerUseBackend` has no attribute `set_value`
run_agent.py:10717: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["content"]` on object of type `str`
agent/anthropic_adapter.py:1577: [invalid-assignment] invalid-assignment: Object of type `Unknown | list[object]` is not assignable to `list[dict[str, Any]] | None`
run_agent.py:10710: [invalid-argument-type] invalid-argument-type: Argument to function `_append_subdir_hint_to_multimodal` is incorrect: Expected `dict[str, Any]`, found `str | Unknown`
tools/computer_use/cua_backend.py:218: [unresolved-import] unresolved-import: Cannot resolve imported module `mcp`
tests/tools/test_computer_use.py:11: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues (1):

Rule	Count
`invalid-assignment`	1

First entries

agent/anthropic_adapter.py:1537: [invalid-assignment] invalid-assignment: Invalid subscript assignment with key of type `Literal["cache_control"]` and value of type `dict[Unknown, Unknown]` on object of type `dict[str, str]`

Unchanged: 4069 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

                logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
-                logging.debug(f"Tool result ({len(function_result)} chars): {function_result}")
+                _log_result = _multimodal_text_summary(function_result)
+                logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}")


teknium1 and others added 4 commits May 8, 2026 09:28

github-advanced-security AI found potential problems May 8, 2026

View reviewed changes

docs(computer-use): add to sidebar nav under Media and Web

4d76562

teknium1 merged commit a735b72 into main May 8, 2026
13 of 15 checks passed

teknium1 deleted the hermes/hermes-c164f8cb branch May 8, 2026 18:07

BrewTestBot mentioned this pull request May 16, 2026

hermes-agent 2026.5.16 Homebrew/homebrew-core#283141

Merged

1 task

github-actions Bot mentioned this pull request May 17, 2026

chore: bump NousResearch/hermes-agent version from v2026.5.7 to v2026.5.16 Docker-Hub-sirmark/docker-hermes-agent#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936)#21967

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936)#21967
teknium1 merged 5 commits into
mainfrom
hermes/hermes-c164f8cb

teknium1 commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

teknium1 commented May 8, 2026

Summary

What it does

Commit sequence (rebase-merge preserves per-commit authorship)

Conflict resolution (vs stale PR #16936 branch)

Interaction with native multimodal routing (#16506)

Validation

Caveats

Closes

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: hermes/hermes-c164f8cb vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 8, 2026 •

edited

Loading

🔎 Lint report: `hermes/hermes-c164f8cb` vs `origin/main`