feat: stage macOS computer-use progress#13308
Closed
glasses666 wants to merge 25 commits into
Closed
Conversation
- capture frontmost macOS window metadata via Swift helper - target screencapture at a specific window id with full-screen fallback - expose window metadata through the computer-use adapter and harden approval matching - cover the new behavior with focused tool and adapter tests
- require approved active sessions for type_text and press_key - reject ambiguous multi-session keyboard targeting without app_session_id - add regression tests for unapproved, ambiguous, and inactive keyboard targets
- persist the canonical approved app identity for localized frontmost sessions - keep revoked localized sessions bound to the same app_session_id and block keyboard input - add regression tests for localized app_session_id refresh and revocation
- have the Swift helper emit all visible layer-0 windows for the frontmost app - choose frontmost metadata in Python while skipping only clearly fragmentary untitled candidates - cover tiny fragments, background titled windows, and compact frontmost dialogs with regression tests
243346f to
e30a74d
Compare
7 tasks
Ataraksea
pushed a commit
to Ataraksea/hermes-agent
that referenced
this pull request
May 13, 2026
…Windows) Three OS-specific tools — `computer_use_macos`, `computer_use_linux`, `computer_use_windows` — sharing one JSON schema and one set of action semantics, but with native backends per platform. Complementary to the containerised proposal in NousResearch#15876 (which targets the "Hermes-runs-in-Docker" deployment shape) and the macOS-Anthropic-protocol work in NousResearch#4562 / NousResearch#13308. This PR owns the "Hermes runs natively on the host desktop, control any of the three majors with consistent abstraction" shape. Architecture ============ * All three tools register at module top via `registry.register()` so the AST tool-discovery picks them up. `check_fn` returns False off the matching platform / when `HERMES_COMPUTER_USE_ENABLED` is unset / when required deps are missing — so on a given host the model only sees the one tool it can actually use. * `computer_use_common.py` — schema, `ActionRequest`, `ActionResult`, parameter validation, screen-bounds enforcement. * `computer_use_safety.py` — env gate, kill-switch flag, JSONL action log under `$HERMES_HOME/logs/computer_use.jsonl`, screenshot redaction (PIL). * `computer_use_grammar.py` — one parser, four targets. `Cmd+Shift+T` produces Quartz CGEvent flags+keycode on macOS, `xdotool key` string on X11, ydotool input event codes on Wayland, Win32 VK codes on Windows. * `computer_use_macos.py` — Quartz `CGEvent` for input, `screencapture` CLI for capture, `CGWindowListCopyWindowInfo` for active window. pyobjc-framework-Quartz is the only new dep. * `computer_use_linux.py` — runtime detection of X11 vs Wayland. X11 → `xdotool` + `scrot`/`import`. Wayland → `ydotool` + `grim` (wlroots) / `gnome-screenshot` / `spectacle`. Active-window queries via Sway IPC / hyprctl / xdotool depending on path. * `computer_use_windows.py` — `ctypes` over `user32.SendInput` (modern path; avoids legacy `keybd_event`). DPI-aware on import. Screenshot via `mss` if installed, falls back to ctypes BitBlt + PIL otherwise. Skills ====== Per-OS skill teaches the model what's actually different on each host: Cmd-vs-Ctrl, Spotlight vs Win+S, X11 vs Wayland detection, UAC/UIPI, accessibility / screen-recording perm setup, etc. The common skill covers when to reach for `computer_use_*` at all (vs `browser_tool` / `terminal`) and the screenshot-first discipline. Validation ========== * 56/56 unit tests passing (mocked Quartz / subprocess / user32 across all three backends + grammar + safety). * macOS backend integration-tested live on the author's MacBook: screen_size, cursor_position, get_active_window, screenshot (full), screenshot (region crop), screenshot (with redact), wait, off-screen click validation, type-without-text validation, unknown-action validation, env-off refusal — all 11/11 cases pass. * Linux + Windows are unit-test-only at the moment; author has no Linux or Windows host immediately available for end-to-end validation. Honest framing in the eventual PR body. Safety posture ============== * `HERMES_COMPUTER_USE_ENABLED=true` required. Default: refused. * Action allowlist + per-action validation (no off-screen, no >10K type strings, no >30s waits, no unknown actions). * Process-global kill-switch flag (`set_kill_switch()`) checked before every action — engaged once, all subsequent actions refuse until cleared. * JSONL audit log of every attempt (action, params minus image bytes, success bit, error if any). * `screenshot` action accepts `redact_regions` to blank rectangles (password manager, MFA codes) before the image reaches the model.
Abd0r
added a commit
to Abd0r/hermes-agent
that referenced
this pull request
May 15, 2026
…Windows) Three OS-specific tools — `computer_use_macos`, `computer_use_linux`, `computer_use_windows` — sharing one JSON schema and one set of action semantics, but with native backends per platform. Complementary to the containerised proposal in NousResearch#15876 (which targets the "Hermes-runs-in-Docker" deployment shape) and the macOS-Anthropic-protocol work in NousResearch#4562 / NousResearch#13308. This PR owns the "Hermes runs natively on the host desktop, control any of the three majors with consistent abstraction" shape. Architecture ============ * All three tools register at module top via `registry.register()` so the AST tool-discovery picks them up. `check_fn` returns False off the matching platform / when `HERMES_COMPUTER_USE_ENABLED` is unset / when required deps are missing — so on a given host the model only sees the one tool it can actually use. * `computer_use_common.py` — schema, `ActionRequest`, `ActionResult`, parameter validation, screen-bounds enforcement. * `computer_use_safety.py` — env gate, kill-switch flag, JSONL action log under `$HERMES_HOME/logs/computer_use.jsonl`, screenshot redaction (PIL). * `computer_use_grammar.py` — one parser, four targets. `Cmd+Shift+T` produces Quartz CGEvent flags+keycode on macOS, `xdotool key` string on X11, ydotool input event codes on Wayland, Win32 VK codes on Windows. * `computer_use_macos.py` — Quartz `CGEvent` for input, `screencapture` CLI for capture, `CGWindowListCopyWindowInfo` for active window. pyobjc-framework-Quartz is the only new dep. * `computer_use_linux.py` — runtime detection of X11 vs Wayland. X11 → `xdotool` + `scrot`/`import`. Wayland → `ydotool` + `grim` (wlroots) / `gnome-screenshot` / `spectacle`. Active-window queries via Sway IPC / hyprctl / xdotool depending on path. * `computer_use_windows.py` — `ctypes` over `user32.SendInput` (modern path; avoids legacy `keybd_event`). DPI-aware on import. Screenshot via `mss` if installed, falls back to ctypes BitBlt + PIL otherwise. Skills ====== Per-OS skill teaches the model what's actually different on each host: Cmd-vs-Ctrl, Spotlight vs Win+S, X11 vs Wayland detection, UAC/UIPI, accessibility / screen-recording perm setup, etc. The common skill covers when to reach for `computer_use_*` at all (vs `browser_tool` / `terminal`) and the screenshot-first discipline. Validation ========== * 56/56 unit tests passing (mocked Quartz / subprocess / user32 across all three backends + grammar + safety). * macOS backend integration-tested live on the author's MacBook: screen_size, cursor_position, get_active_window, screenshot (full), screenshot (region crop), screenshot (with redact), wait, off-screen click validation, type-without-text validation, unknown-action validation, env-off refusal — all 11/11 cases pass. * Linux + Windows are unit-test-only at the moment; author has no Linux or Windows host immediately available for end-to-end validation. Honest framing in the eventual PR body. Safety posture ============== * `HERMES_COMPUTER_USE_ENABLED=true` required. Default: refused. * Action allowlist + per-action validation (no off-screen, no >10K type strings, no >30s waits, no unknown actions). * Process-global kill-switch flag (`set_kill_switch()`) checked before every action — engaged once, all subsequent actions refuse until cleared. * JSONL audit log of every attempt (action, params minus image bytes, success bit, error if any). * `screenshot` action accepts `redact_regions` to blank rectangles (password manager, MFA codes) before the image reaches the model.
Author
|
Closing this stale draft branch; current work is being handled through the narrower gateway/compression PRs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hermes-computer-useadapter with app sessions, window metadata, approval-gated state, and pointer plumbinghermes-computer-use(Allow Once,Session,Always,Deny)Why this matters
Hermes has had strong browser and terminal control for a while, but local desktop control on macOS was still the missing proof point.
This branch moves that boundary:
Grounded receipts
docs/media/computer-use/lockscreen-password-ui.pngdocs/media/computer-use/unlocked-terminal-after-input.pngdocs/media/computer-use/textedit-window-state.pngdocs/media/computer-use/textedit-click-overlay.pngdocs/media/computer-use/textedit-scroll-state.pngdocs/media/computer-use/textedit-drag-selection.png153 passedpy_compile: passed on the touched Python filesCurrent boundary
This PR is the clean macOS computer-use slice against current
main.What is verified already:
get_app_state(app_name=...)returns window-scoped stateclick(...)reaches the real local pointer backend and closes the target TextEdit windowscroll(...)produces a visibly scrolled TextEdit viewportdrag(...)produces a visible multi-line TextEdit selectionWhat is still honestly ahead: