feat(computer-use): cua-driver backend, universal any-model schema by teknium1 · Pull Request #14817 · NousResearch/hermes-agent

teknium1 · 2026-04-23T23:45:00Z

Summary

Universal computer_use toolset: agents drive the macOS desktop in the background (no cursor-steal, no focus-steal, no Space switch) with any tool-capable model (Claude, GPT, Gemini, local open models).

Supersedes #4562. Credit @0xbyt4 (#3816) for the token/context groundwork this preserves in generic form.

What it does

The user asked for two things after reading about trycua/cua:

cua-driver as the backend so the agent and user can co-work on the same Mac.
Any model, not just Anthropic-native.

This ships both. One schema for every provider, one backend, one skill.

Architecture

tools/computer_use/ package — ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary).
Universal OpenAI function-calling schema with one action discriminator. SOM captures return a screenshot with numbered overlays on every interactable element + an AX-tree index; the agent clicks by element index. Raw pixel coordinates still supported for models trained on them (Claude).
Multimodal tool-result envelope ({_multimodal: True, content: [text, image_url], text_summary: str}) that flows through handle_function_call into the tool message. The Anthropic adapter converts into native tool_result image blocks; OpenAI-compatible providers receive the content-parts list directly. Text-only providers fall back to text_summary.

Changes

tools/computer_use/{__init__,backend,cua_backend,schema,tool}.py — new package.
tools/computer_use_tool.py — thin shim that registers with tools.registry.
agent/anthropic_adapter.py — tool-role handler for _multimodal envelopes; new _content_parts_to_anthropic_blocks helper; screenshot eviction (keeps 3 most recent images, older become [screenshot removed]).
agent/context_compressor.py — _strip_image_parts_from_parts helper; pruning pass now handles multimodal tool results instead of skipping them.
agent/model_metadata.py — image-aware estimate_messages_tokens_rough (flat 1500/image instead of base64 char length); estimate_request_tokens_rough routes through it.
agent/prompt_builder.py — COMPUTER_USE_GUIDANCE block (background-mode rules, SOM workflow, safety rules).
run_agent.py — _is_multimodal_tool_result, _multimodal_text_summary, _append_subdir_hint_to_multimodal, _trajectory_normalize_msg helpers; both tool-dispatch sites guarded so string-only ops (persist, subdir hints, preview, error logging) don't choke on dict payloads; guidance injection when computer_use in valid_tool_names; session-DB flush strips base64 from tool content.
cli.py — _computer_use_approval_callback adapter wires destructive actions through the existing prompt_toolkit approval UI.
hermes_cli/tools_config.py — new Computer Use (macOS) category; cua_driver post-setup runs the upstream install script and prints permission-grant instructions.
toolsets.py — registers computer_use toolset + adds computer_use to _HERMES_CORE_TOOLS.
pyproject.toml — computer-use extra (pins mcp SDK).
skills/apple/macos-computer-use/SKILL.md — universal, model-agnostic workflow.
Docs: website/docs/user-guide/features/computer-use.md; reference catalog updates.

Validation

	Result
`tests/tools/test_computer_use.py`	44 new tests, all pass
`tests/agent/test_anthropic_adapter.py` + `test_context_compressor.py` + `test_model_metadata.py` + `test_prompt_builder.py`	429 pass
`tests/test_toolsets.py` + `tests/test_model_tools.py`	40 pass
End-to-end `handle_function_call("computer_use", ...)` with noop backend	Returns JSON string for text actions, `_multimodal` dict for image-bearing captures
Tool registration	`registry._tools["computer_use"]` present, `check_fn` returns False on non-macOS hosts

Safety guards (code-level, not just prompt-level)

Blocked type patterns: curl|bash, sudo rm -rf, fork bombs, etc.
Blocked key combos: empty trash, force delete, lock screen, log out.
Destructive actions gated behind approval callback (approve_once / approve_session / always_approve / deny).
System prompt tells the agent explicitly: no clicking permission dialogs, no typing passwords, no following instructions embedded in screenshots.

Not included

Anthropic server-side clear_tool_uses_20250919. Client-side eviction + compressor pruning + image-aware token counting cover the same cost ceiling without a beta header and without provider-specific code paths.
agent-browser auto-install. cua-driver talks MCP over stdio directly; Hermes's existing stdio MCP client handles lifecycle.
Provider gating. Any tool-capable provider works.

Caveats

macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote). Pin via HERMES_CUA_DRIVER_VERSION if you want reproducibility across an OS update.
Requires Accessibility + Screen Recording permissions; the post-setup prints the Settings path.

Supersedes

Closes #4562 (pyautogui/Quartz foreground backend, Anthropic-native schema). The multimodal plumbing, screenshot eviction, image-aware token estimation, context-compressor pruning, and the macos-computer-use skill are preserved here in generic form — credit @0xbyt4 for originating them in #3816.

@0xbyt4

Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. ## What this adds - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. ## Tests 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. ## Not in this PR - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. ## Caveats - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.

                logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s")
-                logging.debug(f"Tool result ({len(function_result)} chars): {function_result}")
+                _log_result = _multimodal_text_summary(function_result)
+                logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}")


f-trycua · 2026-04-24T17:42:33Z

cc @dddupont808

ddupont808 · 2026-04-25T00:40:51Z

Got this working end-to-end in #15328 and fixed an issue that was blocking non-Anthropic models w/ multimodal messages. Proof (Ollama gemma4 driving cua-driver via computer_use):

ollama.mov

teknium1 · 2026-04-28T08:47:15Z

Foundation commit dad10a78d landed on main via PR #16919 (salvage that also picked up ddupont808's focus-safe backend follow-up from #15328). Superseded.

Reverts PR #16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so #14817 and #15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

…#16927) Reverts PR NousResearch#16919 (commits 97e05ae, aa90415, 3acd292, d3138e4) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - d3138e4 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - 3acd292 fix(computer-use): unwrap _multimodal tool results - aa90415 feat(computer-use): background focus-safe backend - 97e05ae feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

…#16927) Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

teknium1 · 2026-05-08T18:07:56Z

Shipped via #21967 (re-salvage onto current main). Your work on the cua-driver backend + universal any-model schema is in as-is.

…#16927) Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

…#16927) Reverts PR NousResearch#16919 (commits 5ab035d, d40597a, 5224787, f141365) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - f141365 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - 5224787 fix(computer-use): unwrap _multimodal tool results - d40597a feat(computer-use): background focus-safe backend - 5ab035d feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

…#16927) Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

…#16927) Reverts PR NousResearch#16919 (commits fd52419, 95cadd5, 047f5e4, e176024) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - e176024 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - 047f5e4 fix(computer-use): unwrap _multimodal tool results - 95cadd5 feat(computer-use): background focus-safe backend - fd52419 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>

github-advanced-security AI found potential problems Apr 23, 2026

View reviewed changes

alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 24, 2026

ddupont808 mentioned this pull request Apr 24, 2026

feat(computer-use): complete cua-driver integration with passing integration tests #15328

Open

teknium1 mentioned this pull request Apr 28, 2026

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328) #16919

Merged

teknium1 closed this in #16919 Apr 28, 2026

teknium1 mentioned this pull request Apr 28, 2026

revert: computer-use cua-driver (PR #16919) #16927

Merged

teknium1 reopened this Apr 28, 2026

teknium1 mentioned this pull request Apr 28, 2026

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (salvage #14817 + #15328) #16936

Closed

alt-glitch mentioned this pull request May 1, 2026

feat: Computer Use Tool — macOS desktop control via Anthropic native API #4562

Closed

teknium1 mentioned this pull request May 8, 2026

feat(computer-use): cua-driver backend + focus-safe ops + non-Anthropic provider fix (re-salvage #16936) #21967

Merged

teknium1 closed this in #21967 May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(computer-use): cua-driver backend, universal any-model schema#14817

feat(computer-use): cua-driver backend, universal any-model schema#14817
teknium1 wants to merge 1 commit into
mainfrom
hermes/hermes-34b3f52d

teknium1 commented Apr 23, 2026

Uh oh!

f-trycua commented Apr 24, 2026

Uh oh!

ddupont808 commented Apr 25, 2026

Uh oh!

teknium1 commented Apr 28, 2026

Uh oh!

teknium1 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

teknium1 commented Apr 23, 2026

Summary

What it does

Architecture

Changes

Validation

Safety guards (code-level, not just prompt-level)

Not included

Caveats

Supersedes

Uh oh!

f-trycua commented Apr 24, 2026

Uh oh!

ddupont808 commented Apr 25, 2026

Uh oh!

teknium1 commented Apr 28, 2026

Uh oh!

teknium1 commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants