Skip to content

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936)#21967

Merged
teknium1 merged 5 commits into
mainfrom
hermes/hermes-c164f8cb
May 8, 2026
Merged

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936)#21967
teknium1 merged 5 commits into
mainfrom
hermes/hermes-c164f8cb

Conversation

@teknium1

@teknium1 teknium1 commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Ships computer_use (cua-driver backend) on fresh main — re-salvage of PR #16936 after it went stale. Supersedes #4562, #14817, #16936. Credit @ddupont808 (#15328) and @0xbyt4 (#3816).

What it does

Background macOS desktop control via cua-driver — no cursor-steal, no focus-steal, no Space switch — with a universal OpenAI function-calling schema so any tool-capable model can drive the desktop, not just Anthropic.

Commit sequence (rebase-merge preserves per-commit authorship)

  1. @teknium1feat(computer-use): cua-driver backend, universal any-model schema

    • tools/computer_use/ package, universal schema, SOM captures.
  2. @ddupont808feat(computer-use): background focus-safe backend — set_value, structured windows, MIME detection

    • capture()list_windows + get_window_state, sticky (pid, window_id), type_text_chars routing, set_value for backgrounded AXPopUpButton / HTML <select>, JPEG MIME detection, regex helpers.
  3. @ddupont808fix(computer-use): unwrap _multimodal tool results to content list for non-Anthropic providers

    • Tool-message builder unwraps {_multimodal: True, content: [...]} envelopes to OpenAI-style content-parts at parallel + sequential build sites.
    • Adds _vision_supported adaptive fallback: on first image-rejection error, strips images from history and retries text-only for the rest of _run().
  4. @teknium1fix(computer-use): harden image-rejection fallback + AUTHOR_MAP

    • _strip_images_from_messages() preserves tool_call_id linkage (replaces tool-role content with text placeholder instead of deleting — otherwise providers 400 on "tool_calls without matching tool response").
    • Phrase list expanded; 4xx-only gate so transient 5xx/timeout aren't misclassified.
    • 14 regression tests (TestStripImagesPreservesAlternation, TestImageRejectionPhraseIsolation).
    • AUTHOR_MAP entry for ddupont808.

Conflict resolution (vs stale PR #16936 branch)

Rebased onto current main; 4 conflicts resolved by keeping both sides:

  • agent/context_compressor.py — kept HEAD's broader not isinstance(content, str) guard (already catches multimodal dicts), added comment about _multimodal envelopes.
  • toolsets.py_HERMES_CORE_TOOLS now has both kanban entries (HEAD) and computer_use (PR).
  • run_agent.py (3 sites) — kept HEAD's "name": function_name field in tool_msg + PR's _tool_content unwrap; kept both _tool_guardrails.reset_for_turn() (HEAD) and _vision_supported = True (PR) turn resets.
  • website/docs/reference/toolsets-reference.md — both rows preserved, HEAD's richer image_gen description kept.

Interaction with native multimodal routing (#16506)

Native-vision handles common cases proactively. _vision_supported fallback is a 4th-layer net for models whose capability detection claims vision but whose specific deployment rejects images at runtime (mlx-lm, misconfigured proxies, text-only endpoints). Fires only on 4xx + phrase match. Phrase isolation is test-guarded.

Validation

Check Result
tests/tools/test_computer_use.py 57/57 pass
tests/run_agent/test_image_rejection_fallback.py 14/14 pass
tests/tools/test_registry.py 18/18 pass
Adjacent vision suites (test_image_routing, test_compressor_image_tokens, test_vision_resolved_args) 45/45 pass
E2E: computer_use registered in _HERMES_CORE_TOOLS, package imports, _is_multimodal_tool_result + _strip_images_from_messages + _vision_supported helpers present in run_agent pass
Syntax check on run_agent.py / toolsets.py / context_compressor.py / model_tools.py / cli.py pass

Caveats

  • Requires cua-driver ≥ 0.0.4 for set_value and structuredContent on list_windows/launch_app.
  • drag is not implemented — cua-driver does not expose a drag tool.
  • Image-rejection detection is best-effort English phrase matching; locale-translated wordings will bypass the guard and fall through to the normal error handler.

Closes

Closes #16936 (re-salvage onto current main)
Closes #14817 (superseded by #16936 which this supersedes)
Closes #4562 (original macOS-only Anthropic-only version, superseded)

teknium1 and others added 4 commits May 8, 2026 09:28
Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
…ured windows, MIME detection

Extends the cua-driver computer-use backend to drive backgrounded macOS
windows without stealing keyboard or mouse focus from the foreground app.
All changes target the cua-driver MCP backend and the shared dispatcher.

## cua_backend.py

**Window-aware capture**: capture() now calls list_windows + get_window_state
instead of the removed capture tool. Prefers structuredContent.windows
(MCP 2024-11-05+ cua-driver) for zero-parse window enumeration; falls back
to regex-parsed text for older builds. Stores the selected (pid, window_id)
as sticky context so subsequent action calls do not need a redundant round-trip.

**Action routing**: click/scroll/type_text/key all carry the sticky pid
(and window_id for element-indexed clicks). type_text routes through
type_text_chars (individual key events) rather than AX attribute write --
WebKit AXTextFields reject attribute writes from backgrounded processes.

**Key parsing**: _parse_key_combo splits cmd+s-style strings into
(key, [modifiers]) and routes to hotkey (modifier present) or
press_key (bare key) -- cua-driver actual tool names.

**set_value method**: new set_value(value, element) calls the cua-driver
set_value MCP tool. For AXPopUpButton / HTML select in a backgrounded Safari,
AXPress opens the native macOS popup which closes immediately when the app is
non-frontmost; set_value AX-presses the matching child option directly
(no menu required, no focus steal).

**focus_app**: reimplemented as a pure window-selector (enumerates
list_windows, sets sticky pid/window_id) without ever raising the window
or stealing focus.

**list_apps**: fixed tool name from listApps to list_apps; handles plain-text
response via regex when structured data is absent.

**Structured-content extraction**: _extract_tool_result now surfaces
structuredContent from MCP results, enabling the list_windows window array
without text parsing.

**Helpers**: _parse_windows_from_text, _parse_elements_from_tree,
_split_tree_text, _parse_key_combo extracted as module-level functions.

## schema.py

Added set_value to the action enum with a description explaining when to
prefer it over click (select/popup elements, sliders, no focus steal).
Added value field for set_value payloads.

## tool.py

Routed set_value action through _dispatch to backend.set_value.
Added set_value to _DESTRUCTIVE_ACTIONS (approval-gated).
Fixed MIME-type detection in _capture_response: cua-driver may return
JPEG; detect from base64 magic bytes (/9j/ -> image/jpeg, else image/png)
rather than hardcoding image/png.

## agent/display.py + run_agent.py

Guard _detect_tool_failure and result-preview logic against non-string
function_result values: multimodal tool results (dicts with _multimodal=True)
are not string-sliceable; treat them as successes and fall back to str()
for length/preview.
…r non-Anthropic providers

Tool handlers (e.g. computer_use capture) return a _multimodal envelope
dict when a screenshot is attached. The tool-message builder was passing
this raw dict as the `content` field of role:tool messages, which is an
illegal format — OpenAI-compatible APIs expect a string or a content-parts
list, not a plain Python dict, and would reject it with a 400/422 error.

Fix: unwrap _multimodal results to their `content` list
([{type:text,...},{type:image_url,...}]) in both the parallel and
sequential tool-call paths. The Anthropic adapter already handles content
lists natively; vision-capable OpenAI-compatible servers (mlx-vlm,
GPT-4o, etc.) accept image_url parts in tool messages directly.

Also add a _vision_supported adaptive fallback: on first image-rejection
error ("Only 'text' content type is supported." etc.) the agent strips all
image parts from the message history and retries with text only, so
text-only endpoints degrade gracefully without crashing the session.
Follow-up to #15328's vision-unsupported retry branch in run_agent.py.

_strip_images_from_messages() previously deleted any message whose content
was entirely images. That's fine for synthetic user messages injected for
attachment delivery, but it breaks providers for tool-role messages — the
paired tool_call_id on the preceding assistant message ends up unmatched,
which OpenAI-compatible APIs reject with HTTP 400.

Fix: tool-role messages whose content becomes empty are replaced with a
plaintext placeholder that preserves the tool_call_id linkage. Only
non-tool messages are dropped. Added 10 tests covering the role-alternation
invariants + image-type coverage.

Image-rejection detector: expanded phrase list (image content not
supported / multimodal input / vision input / model does not support
image) and gated on 4xx status so transient 5xx errors never get
misinterpreted as 'server said no to images'. Detection is documented as
best-effort English phrase matching.

AUTHOR_MAP: mapped 3820588+ddupont808@users.noreply.github.com to
ddupont808 so release notes attribute the salvage correctly.
@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-c164f8cb vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7779 on HEAD, 7766 on base (🆕 +13)

🆕 New issues (11):

Rule Count
invalid-argument-type 4
unresolved-attribute 3
unresolved-import 3
invalid-assignment 1
First entries
tools/computer_use/cua_backend.py:263: [unresolved-attribute] unresolved-attribute: Attribute `call_tool` is not defined on `None` in union `None | Unknown`
tools/computer_use/cua_backend.py:621: [unresolved-attribute] unresolved-attribute: Attribute `lower` is not defined on `int` in union `Any | int`
tools/computer_use/cua_backend.py:436: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `str`, found `Any | int`
tools/computer_use/cua_backend.py:669: [invalid-argument-type] invalid-argument-type: Argument is incorrect: Expected `tuple[int, int, int, int]`, found `tuple[int, ...]`
tools/computer_use/cua_backend.py:219: [unresolved-import] unresolved-import: Cannot resolve imported module `mcp.client.stdio`
tools/computer_use/tool.py:394: [unresolved-attribute] unresolved-attribute: Object of type `ComputerUseBackend` has no attribute `set_value`
run_agent.py:10717: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["content"]` on object of type `str`
agent/anthropic_adapter.py:1577: [invalid-assignment] invalid-assignment: Object of type `Unknown | list[object]` is not assignable to `list[dict[str, Any]] | None`
run_agent.py:10710: [invalid-argument-type] invalid-argument-type: Argument to function `_append_subdir_hint_to_multimodal` is incorrect: Expected `dict[str, Any]`, found `str | Unknown`
tools/computer_use/cua_backend.py:218: [unresolved-import] unresolved-import: Cannot resolve imported module `mcp`
tests/tools/test_computer_use.py:11: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`

✅ Fixed issues (1):

Rule Count
invalid-assignment 1
First entries
agent/anthropic_adapter.py:1537: [invalid-assignment] invalid-assignment: Invalid subscript assignment with key of type `Literal["cache_control"]` and value of type `dict[Unknown, Unknown]` on object of type `dict[str, str]`

Unchanged: 4069 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Comment thread run_agent.py
logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
logging.debug(f"Tool result ({len(function_result)} chars): {function_result}")
_log_result = _multimodal_text_summary(function_result)
logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants