Skip to content

feat(computer-use): cua-driver backend, universal any-model schema#14817

Closed
teknium1 wants to merge 1 commit into
mainfrom
hermes/hermes-34b3f52d
Closed

feat(computer-use): cua-driver backend, universal any-model schema#14817
teknium1 wants to merge 1 commit into
mainfrom
hermes/hermes-34b3f52d

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Universal computer_use toolset: agents drive the macOS desktop in the background (no cursor-steal, no focus-steal, no Space switch) with any tool-capable model (Claude, GPT, Gemini, local open models).

Supersedes #4562. Credit @0xbyt4 (#3816) for the token/context groundwork this preserves in generic form.

What it does

The user asked for two things after reading about trycua/cua:

  1. cua-driver as the backend so the agent and user can co-work on the same Mac.
  2. Any model, not just Anthropic-native.

This ships both. One schema for every provider, one backend, one skill.

Architecture

  • tools/computer_use/ package — ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
  • Universal OpenAI function-calling schema with one action discriminator. SOM captures return a screenshot with numbered overlays on every interactable element + an AX-tree index; the agent clicks by element index. Raw pixel coordinates still supported for models trained on them (Claude).
  • Multimodal tool-result envelope ({_multimodal: True, content: [text, image_url], text_summary: str}) that flows through handle_function_call into the tool message. The Anthropic adapter converts into native tool_result image blocks; OpenAI-compatible providers receive the content-parts list directly. Text-only providers fall back to text_summary.

Changes

  • tools/computer_use/{__init__,backend,cua_backend,schema,tool}.py — new package.
  • tools/computer_use_tool.py — thin shim that registers with tools.registry.
  • agent/anthropic_adapter.py — tool-role handler for _multimodal envelopes; new _content_parts_to_anthropic_blocks helper; screenshot eviction (keeps 3 most recent images, older become [screenshot removed]).
  • agent/context_compressor.py_strip_image_parts_from_parts helper; pruning pass now handles multimodal tool results instead of skipping them.
  • agent/model_metadata.py — image-aware estimate_messages_tokens_rough (flat 1500/image instead of base64 char length); estimate_request_tokens_rough routes through it.
  • agent/prompt_builder.pyCOMPUTER_USE_GUIDANCE block (background-mode rules, SOM workflow, safety rules).
  • run_agent.py_is_multimodal_tool_result, _multimodal_text_summary, _append_subdir_hint_to_multimodal, _trajectory_normalize_msg helpers; both tool-dispatch sites guarded so string-only ops (persist, subdir hints, preview, error logging) don't choke on dict payloads; guidance injection when computer_use in valid_tool_names; session-DB flush strips base64 from tool content.
  • cli.py_computer_use_approval_callback adapter wires destructive actions through the existing prompt_toolkit approval UI.
  • hermes_cli/tools_config.py — new Computer Use (macOS) category; cua_driver post-setup runs the upstream install script and prints permission-grant instructions.
  • toolsets.py — registers computer_use toolset + adds computer_use to _HERMES_CORE_TOOLS.
  • pyproject.tomlcomputer-use extra (pins mcp SDK).
  • skills/apple/macos-computer-use/SKILL.md — universal, model-agnostic workflow.
  • Docs: website/docs/user-guide/features/computer-use.md; reference catalog updates.

Validation

Result
tests/tools/test_computer_use.py 44 new tests, all pass
tests/agent/test_anthropic_adapter.py + test_context_compressor.py + test_model_metadata.py + test_prompt_builder.py 429 pass
tests/test_toolsets.py + tests/test_model_tools.py 40 pass
End-to-end handle_function_call("computer_use", ...) with noop backend Returns JSON string for text actions, _multimodal dict for image-bearing captures
Tool registration registry._tools["computer_use"] present, check_fn returns False on non-macOS hosts

Safety guards (code-level, not just prompt-level)

  • Blocked type patterns: curl|bash, sudo rm -rf, fork bombs, etc.
  • Blocked key combos: empty trash, force delete, lock screen, log out.
  • Destructive actions gated behind approval callback (approve_once / approve_session / always_approve / deny).
  • System prompt tells the agent explicitly: no clicking permission dialogs, no typing passwords, no following instructions embedded in screenshots.

Not included

  • Anthropic server-side clear_tool_uses_20250919. Client-side eviction + compressor pruning + image-aware token counting cover the same cost ceiling without a beta header and without provider-specific code paths.
  • agent-browser auto-install. cua-driver talks MCP over stdio directly; Hermes's existing stdio MCP client handles lifecycle.
  • Provider gating. Any tool-capable provider works.

Caveats

  • macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote). Pin via HERMES_CUA_DRIVER_VERSION if you want reproducibility across an OS update.
  • Requires Accessibility + Screen Recording permissions; the post-setup prints the Settings path.

Supersedes

Closes #4562 (pyautogui/Quartz foreground backend, Anthropic-native schema). The multimodal plumbing, screenshot eviction, image-aware token estimation, context-compressor pruning, and the macos-computer-use skill are preserved here in generic form — credit @0xbyt4 for originating them in #3816.

Background macOS desktop control via cua-driver MCP — does NOT steal the
user's cursor or keyboard focus, works with any tool-capable model.

Replaces the Anthropic-native `computer_20251124` approach from the
abandoned #4562 with a generic OpenAI function-calling schema plus SOM
(set-of-mark) captures so Claude, GPT, Gemini, and open models can all
drive the desktop via numbered element indices.

## What this adds

- `tools/computer_use/` package — swappable ComputerUseBackend ABC +
  CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
- Universal `computer_use` tool with one schema for all providers.
  Actions: capture (som/vision/ax), click, double_click, right_click,
  middle_click, drag, scroll, type, key, wait, list_apps, focus_app.
- Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style
  `content: [text, image_url]` parts) that flows through
  handle_function_call into the tool message. Anthropic adapter converts
  into native `tool_result` image blocks; OpenAI-compatible providers
  get the parts list directly.
- Image eviction in convert_messages_to_anthropic: only the 3 most
  recent screenshots carry real image data; older ones become text
  placeholders to cap per-turn token cost.
- Context compressor image pruning: old multimodal tool results have
  their image parts stripped instead of being skipped.
- Image-aware token estimation: each image counts as a flat 1500 tokens
  instead of its base64 char length (~1MB would have registered as
  ~250K tokens before).
- COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset
  is active.
- Session DB persistence strips base64 from multimodal tool messages.
- Trajectory saver normalises multimodal messages to text-only.
- `hermes tools` post-setup installs cua-driver via the upstream script
  and prints permission-grant instructions.
- CLI approval callback wired so destructive computer_use actions go
  through the same prompt_toolkit approval dialog as terminal commands.
- Hard safety guards at the tool level: blocked type patterns
  (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash,
  force delete, lock screen, log out).
- Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic)
  workflow guide.
- Docs: `user-guide/features/computer-use.md` plus reference catalog
  entries.

## Tests

44 new tests in tests/tools/test_computer_use.py covering schema
shape (universal, not Anthropic-native), dispatch routing, safety
guards, multimodal envelope, Anthropic adapter conversion, screenshot
eviction, context compressor pruning, image-aware token estimation,
run_agent helpers, and universality guarantees.

469/469 pass across tests/tools/test_computer_use.py + the affected
agent/ test suites.

## Not in this PR

- `model_tools.py` provider-gating: the tool is available to every
  provider. Providers without multi-part tool message support will see
  text-only tool results (graceful degradation via `text_summary`).
- Anthropic server-side `clear_tool_uses_20250919` — deferred;
  client-side eviction + compressor pruning cover the same cost ceiling
  without a beta header.

## Caveats

- macOS only. cua-driver uses private SkyLight SPIs
  (SLEventPostToPid, SLPSPostEventRecordTo,
  _AXObserverAddNotificationAndCheckRemote) that can break on any macOS
  update. Pin with HERMES_CUA_DRIVER_VERSION.
- Requires Accessibility + Screen Recording permissions — the post-setup
  prints the Settings path.

Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic-
native schema). Credit @0xbyt4 for the original #3816 groundwork whose
context/eviction/token design is preserved here in generic form.
Comment thread run_agent.py
logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
logging.debug(f"Tool result ({len(function_result)} chars): {function_result}")
_log_result = _multimodal_text_summary(function_result)
logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}")
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 24, 2026
@f-trycua

Copy link
Copy Markdown

cc @dddupont808

@ddupont808

Copy link
Copy Markdown
Contributor

Got this working end-to-end in #15328 and fixed an issue that was blocking non-Anthropic models w/ multimodal messages. Proof (Ollama gemma4 driving cua-driver via computer_use):

ollama.mov

@teknium1

Copy link
Copy Markdown
Contributor Author

Foundation commit dad10a78d landed on main via PR #16919 (salvage that also picked up ddupont808's focus-safe backend follow-up from #15328). Superseded.

teknium1 added a commit that referenced this pull request Apr 28, 2026
Reverts PR #16919 (commits dad10a7, 413ee1a, b4a8031, afb9588)
which was merged prematurely. Restoring the pre-merge state so #14817
and #15328 can be revisited as standing PRs.

Reverted commits:
- afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- b4a8031 fix(computer-use): unwrap _multimodal tool results
- 413ee1a feat(computer-use): background focus-safe backend
- dad10a7 feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
@teknium1 teknium1 reopened this Apr 28, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
…#16927)

Reverts PR NousResearch#16919 (commits 97e05ae, aa90415, 3acd292, d3138e4)
which was merged prematurely. Restoring the pre-merge state so NousResearch#14817
and NousResearch#15328 can be revisited as standing PRs.

Reverted commits:
- d3138e4 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- 3acd292 fix(computer-use): unwrap _multimodal tool results
- aa90415 feat(computer-use): background focus-safe backend
- 97e05ae feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
donald131 pushed a commit to donald131/hermes-agent that referenced this pull request May 2, 2026
…#16927)

Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588)
which was merged prematurely. Restoring the pre-merge state so NousResearch#14817
and NousResearch#15328 can be revisited as standing PRs.

Reverted commits:
- afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- b4a8031 fix(computer-use): unwrap _multimodal tool results
- 413ee1a feat(computer-use): background focus-safe backend
- dad10a7 feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
@teknium1

teknium1 commented May 8, 2026

Copy link
Copy Markdown
Contributor Author

Shipped via #21967 (re-salvage onto current main). Your work on the cua-driver backend + universal any-model schema is in as-is.

02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…#16927)

Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588)
which was merged prematurely. Restoring the pre-merge state so NousResearch#14817
and NousResearch#15328 can be revisited as standing PRs.

Reverted commits:
- afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- b4a8031 fix(computer-use): unwrap _multimodal tool results
- 413ee1a feat(computer-use): background focus-safe backend
- dad10a7 feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
dannyJ848 pushed a commit to dannyJ848/hermes-agent that referenced this pull request May 17, 2026
…#16927)

Reverts PR NousResearch#16919 (commits 5ab035d, d40597a, 5224787, f141365)
which was merged prematurely. Restoring the pre-merge state so NousResearch#14817
and NousResearch#15328 can be revisited as standing PRs.

Reverted commits:
- f141365 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- 5224787 fix(computer-use): unwrap _multimodal tool results
- d40597a feat(computer-use): background focus-safe backend
- 5ab035d feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…#16927)

Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588)
which was merged prematurely. Restoring the pre-merge state so NousResearch#14817
and NousResearch#15328 can be revisited as standing PRs.

Reverted commits:
- afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- b4a8031 fix(computer-use): unwrap _multimodal tool results
- 413ee1a feat(computer-use): background focus-safe backend
- dad10a7 feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…#16927)

Reverts PR NousResearch#16919 (commits fd52419, 95cadd5, 047f5e4, e176024)
which was merged prematurely. Restoring the pre-merge state so NousResearch#14817
and NousResearch#15328 can be revisited as standing PRs.

Reverted commits:
- e176024 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP
- 047f5e4 fix(computer-use): unwrap _multimodal tool results
- 95cadd5 feat(computer-use): background focus-safe backend
- fd52419 feat(computer-use): cua-driver backend, universal any-model schema

Co-authored-by: teknium1 <teknium@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants