feat(computer-use): cua-driver backend, universal any-model schema#14817
Closed
teknium1 wants to merge 1 commit into
Closed
feat(computer-use): cua-driver backend, universal any-model schema#14817teknium1 wants to merge 1 commit into
teknium1 wants to merge 1 commit into
Conversation
Background macOS desktop control via cua-driver MCP — does NOT steal the user's cursor or keyboard focus, works with any tool-capable model. Replaces the Anthropic-native `computer_20251124` approach from the abandoned #4562 with a generic OpenAI function-calling schema plus SOM (set-of-mark) captures so Claude, GPT, Gemini, and open models can all drive the desktop via numbered element indices. ## What this adds - `tools/computer_use/` package — swappable ComputerUseBackend ABC + CuaDriverBackend (stdio MCP client to trycua/cua's cua-driver binary). - Universal `computer_use` tool with one schema for all providers. Actions: capture (som/vision/ax), click, double_click, right_click, middle_click, drag, scroll, type, key, wait, list_apps, focus_app. - Multimodal tool-result envelope (`_multimodal=True`, OpenAI-style `content: [text, image_url]` parts) that flows through handle_function_call into the tool message. Anthropic adapter converts into native `tool_result` image blocks; OpenAI-compatible providers get the parts list directly. - Image eviction in convert_messages_to_anthropic: only the 3 most recent screenshots carry real image data; older ones become text placeholders to cap per-turn token cost. - Context compressor image pruning: old multimodal tool results have their image parts stripped instead of being skipped. - Image-aware token estimation: each image counts as a flat 1500 tokens instead of its base64 char length (~1MB would have registered as ~250K tokens before). - COMPUTER_USE_GUIDANCE system-prompt block — injected when the toolset is active. - Session DB persistence strips base64 from multimodal tool messages. - Trajectory saver normalises multimodal messages to text-only. - `hermes tools` post-setup installs cua-driver via the upstream script and prints permission-grant instructions. - CLI approval callback wired so destructive computer_use actions go through the same prompt_toolkit approval dialog as terminal commands. - Hard safety guards at the tool level: blocked type patterns (curl|bash, sudo rm -rf, fork bomb), blocked key combos (empty trash, force delete, lock screen, log out). - Skill `apple/macos-computer-use/SKILL.md` — universal (model-agnostic) workflow guide. - Docs: `user-guide/features/computer-use.md` plus reference catalog entries. ## Tests 44 new tests in tests/tools/test_computer_use.py covering schema shape (universal, not Anthropic-native), dispatch routing, safety guards, multimodal envelope, Anthropic adapter conversion, screenshot eviction, context compressor pruning, image-aware token estimation, run_agent helpers, and universality guarantees. 469/469 pass across tests/tools/test_computer_use.py + the affected agent/ test suites. ## Not in this PR - `model_tools.py` provider-gating: the tool is available to every provider. Providers without multi-part tool message support will see text-only tool results (graceful degradation via `text_summary`). - Anthropic server-side `clear_tool_uses_20250919` — deferred; client-side eviction + compressor pruning cover the same cost ceiling without a beta header. ## Caveats - macOS only. cua-driver uses private SkyLight SPIs (SLEventPostToPid, SLPSPostEventRecordTo, _AXObserverAddNotificationAndCheckRemote) that can break on any macOS update. Pin with HERMES_CUA_DRIVER_VERSION. - Requires Accessibility + Screen Recording permissions — the post-setup prints the Settings path. Supersedes PR #4562 (pyautogui/Quartz foreground backend, Anthropic- native schema). Credit @0xbyt4 for the original #3816 groundwork whose context/eviction/token design is preserved here in generic form.
| logging.debug(f"Tool {function_name} completed in {tool_duration:.2f}s") | ||
| logging.debug(f"Tool result ({len(function_result)} chars): {function_result}") | ||
| _log_result = _multimodal_text_summary(function_result) | ||
| logging.debug(f"Tool result ({len(_log_result)} chars): {_log_result}") |
|
cc @dddupont808 |
Contributor
|
Got this working end-to-end in #15328 and fixed an issue that was blocking non-Anthropic models w/ multimodal messages. Proof (Ollama gemma4 driving cua-driver via ollama.mov |
Contributor
Author
teknium1
added a commit
that referenced
this pull request
Apr 28, 2026
Reverts PR #16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so #14817 and #15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
…#16927) Reverts PR NousResearch#16919 (commits 97e05ae, aa90415, 3acd292, d3138e4) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - d3138e4 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - 3acd292 fix(computer-use): unwrap _multimodal tool results - aa90415 feat(computer-use): background focus-safe backend - 97e05ae feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
donald131
pushed a commit
to donald131/hermes-agent
that referenced
this pull request
May 2, 2026
…#16927) Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
Contributor
Author
|
Shipped via #21967 (re-salvage onto current main). Your work on the cua-driver backend + universal any-model schema is in as-is. |
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…#16927) Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
dannyJ848
pushed a commit
to dannyJ848/hermes-agent
that referenced
this pull request
May 17, 2026
…#16927) Reverts PR NousResearch#16919 (commits 5ab035d, d40597a, 5224787, f141365) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - f141365 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - 5224787 fix(computer-use): unwrap _multimodal tool results - d40597a feat(computer-use): background focus-safe backend - 5ab035d feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
…#16927) Reverts PR NousResearch#16919 (commits dad10a7, 413ee1a, b4a8031, afb9588) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - afb9588 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - b4a8031 fix(computer-use): unwrap _multimodal tool results - 413ee1a feat(computer-use): background focus-safe backend - dad10a7 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…#16927) Reverts PR NousResearch#16919 (commits fd52419, 95cadd5, 047f5e4, e176024) which was merged prematurely. Restoring the pre-merge state so NousResearch#14817 and NousResearch#15328 can be revisited as standing PRs. Reverted commits: - e176024 fix(computer-use): harden image-rejection fallback + AUTHOR_MAP - 047f5e4 fix(computer-use): unwrap _multimodal tool results - 95cadd5 feat(computer-use): background focus-safe backend - fd52419 feat(computer-use): cua-driver backend, universal any-model schema Co-authored-by: teknium1 <teknium@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Universal
computer_usetoolset: agents drive the macOS desktop in the background (no cursor-steal, no focus-steal, no Space switch) with any tool-capable model (Claude, GPT, Gemini, local open models).Supersedes #4562. Credit @0xbyt4 (#3816) for the token/context groundwork this preserves in generic form.
What it does
The user asked for two things after reading about trycua/cua:
This ships both. One schema for every provider, one backend, one skill.
Architecture
tools/computer_use/package —ComputerUseBackendABC +CuaDriverBackend(stdio MCP client totrycua/cua'scua-driverbinary).{_multimodal: True, content: [text, image_url], text_summary: str}) that flows throughhandle_function_callinto the tool message. The Anthropic adapter converts into nativetool_resultimage blocks; OpenAI-compatible providers receive the content-parts list directly. Text-only providers fall back totext_summary.Changes
tools/computer_use/{__init__,backend,cua_backend,schema,tool}.py— new package.tools/computer_use_tool.py— thin shim that registers withtools.registry.agent/anthropic_adapter.py— tool-role handler for_multimodalenvelopes; new_content_parts_to_anthropic_blockshelper; screenshot eviction (keeps 3 most recent images, older become[screenshot removed]).agent/context_compressor.py—_strip_image_parts_from_partshelper; pruning pass now handles multimodal tool results instead of skipping them.agent/model_metadata.py— image-awareestimate_messages_tokens_rough(flat 1500/image instead of base64 char length);estimate_request_tokens_roughroutes through it.agent/prompt_builder.py—COMPUTER_USE_GUIDANCEblock (background-mode rules, SOM workflow, safety rules).run_agent.py—_is_multimodal_tool_result,_multimodal_text_summary,_append_subdir_hint_to_multimodal,_trajectory_normalize_msghelpers; both tool-dispatch sites guarded so string-only ops (persist, subdir hints, preview, error logging) don't choke on dict payloads; guidance injection whencomputer_use in valid_tool_names; session-DB flush strips base64 from tool content.cli.py—_computer_use_approval_callbackadapter wires destructive actions through the existing prompt_toolkit approval UI.hermes_cli/tools_config.py— newComputer Use (macOS)category;cua_driverpost-setup runs the upstream install script and prints permission-grant instructions.toolsets.py— registerscomputer_usetoolset + addscomputer_useto_HERMES_CORE_TOOLS.pyproject.toml—computer-useextra (pinsmcpSDK).skills/apple/macos-computer-use/SKILL.md— universal, model-agnostic workflow.website/docs/user-guide/features/computer-use.md; reference catalog updates.Validation
tests/tools/test_computer_use.pytests/agent/test_anthropic_adapter.py+test_context_compressor.py+test_model_metadata.py+test_prompt_builder.pytests/test_toolsets.py+tests/test_model_tools.pyhandle_function_call("computer_use", ...)with noop backend_multimodaldict for image-bearing capturesregistry._tools["computer_use"]present,check_fnreturns False on non-macOS hostsSafety guards (code-level, not just prompt-level)
curl|bash,sudo rm -rf, fork bombs, etc.Not included
clear_tool_uses_20250919. Client-side eviction + compressor pruning + image-aware token counting cover the same cost ceiling without a beta header and without provider-specific code paths.agent-browserauto-install. cua-driver talks MCP over stdio directly; Hermes's existing stdio MCP client handles lifecycle.Caveats
SLEventPostToPid,SLPSPostEventRecordTo,_AXObserverAddNotificationAndCheckRemote). Pin viaHERMES_CUA_DRIVER_VERSIONif you want reproducibility across an OS update.Supersedes
Closes #4562 (pyautogui/Quartz foreground backend, Anthropic-native schema). The multimodal plumbing, screenshot eviction, image-aware token estimation, context-compressor pruning, and the macos-computer-use skill are preserved here in generic form — credit @0xbyt4 for originating them in #3816.