Skip to content

feat: stage macOS computer-use progress#13308

Closed
glasses666 wants to merge 25 commits into
NousResearch:mainfrom
glasses666:feat/computer-use-session-adapter
Closed

feat: stage macOS computer-use progress#13308
glasses666 wants to merge 25 commits into
NousResearch:mainfrom
glasses666:feat/computer-use-session-adapter

Conversation

@glasses666

@glasses666 glasses666 commented Apr 21, 2026

Copy link
Copy Markdown

Summary

  • add a real macOS hermes-computer-use adapter with app sessions, window metadata, approval-gated state, and pointer plumbing
  • add gateway-mediated app-access approvals for hermes-computer-use (Allow Once, Session, Always, Deny)
  • add Telegram inline approval buttons for generic app-access requests
  • add refreshed grounded macOS proof screenshots taken on the real desktop
  • replace the old purple pixel-art overlay with a more macOS-native white pointer treatment
  • verify real click success against TextEdit through both the local backend and the live MCP/chat path
  • verify real local scroll + drag receipts in the branch-local adapter/backend TextEdit smoke

Why this matters

Hermes has had strong browser and terminal control for a while, but local desktop control on macOS was still the missing proof point.

This branch moves that boundary:

  • desktop sessions are approval-gated instead of silently trusted
  • app state is window-scoped and inspectable
  • lockscreen handling is grounded
  • real pointer actions are no longer just mocked previews

Grounded receipts

  • docs/media/computer-use/lockscreen-password-ui.png
  • docs/media/computer-use/unlocked-terminal-after-input.png
  • docs/media/computer-use/textedit-window-state.png
  • docs/media/computer-use/textedit-click-overlay.png
  • docs/media/computer-use/textedit-scroll-state.png
  • docs/media/computer-use/textedit-drag-selection.png
  • focused tests: 153 passed
  • py_compile: passed on the touched Python files

Current boundary

This PR is the clean macOS computer-use slice against current main.

What is verified already:

  • Telegram approval flow is wired for app access
  • get_app_state(app_name=...) returns window-scoped state
  • lockscreen -> desktop recovery is proven with explicit user consent
  • live MCP click(...) reaches the real local pointer backend and closes the target TextEdit window
  • branch-local scroll(...) produces a visibly scrolled TextEdit viewport
  • branch-local drag(...) produces a visible multi-line TextEdit selection

What is still honestly ahead:

  • detached / non-disruptive cursor UX
  • broader session polish beyond the current verified slice
  • careful boundary-setting around permission dialogs and other macOS-sensitive surfaces
  • hot-reloading and re-verifying the newest scroll/drag slice through every live runtime path

- capture frontmost macOS window metadata via Swift helper
- target screencapture at a specific window id with full-screen fallback
- expose window metadata through the computer-use adapter and harden approval matching
- cover the new behavior with focused tool and adapter tests
- require approved active sessions for type_text and press_key
- reject ambiguous multi-session keyboard targeting without app_session_id
- add regression tests for unapproved, ambiguous, and inactive keyboard targets
- persist the canonical approved app identity for localized frontmost sessions
- keep revoked localized sessions bound to the same app_session_id and block keyboard input
- add regression tests for localized app_session_id refresh and revocation
- have the Swift helper emit all visible layer-0 windows for the frontmost app
- choose frontmost metadata in Python while skipping only clearly fragmentary untitled candidates
- cover tiny fragments, background titled windows, and compact frontmost dialogs with regression tests
@glasses666 glasses666 force-pushed the feat/computer-use-session-adapter branch from 243346f to e30a74d Compare April 21, 2026 05:02
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery labels Apr 22, 2026
@alt-glitch alt-glitch added the platform/telegram Telegram bot adapter label Apr 22, 2026
@glasses666 glasses666 closed this Apr 22, 2026
@glasses666 glasses666 reopened this Apr 22, 2026
Ataraksea pushed a commit to Ataraksea/hermes-agent that referenced this pull request May 13, 2026
…Windows)

Three OS-specific tools — `computer_use_macos`, `computer_use_linux`,
`computer_use_windows` — sharing one JSON schema and one set of action
semantics, but with native backends per platform. Complementary to
the containerised proposal in NousResearch#15876 (which targets the
"Hermes-runs-in-Docker" deployment shape) and the macOS-Anthropic-protocol
work in NousResearch#4562 / NousResearch#13308. This PR owns the "Hermes runs natively on the
host desktop, control any of the three majors with consistent abstraction"
shape.

Architecture
============

* All three tools register at module top via `registry.register()` so
  the AST tool-discovery picks them up. `check_fn` returns False off the
  matching platform / when `HERMES_COMPUTER_USE_ENABLED` is unset / when
  required deps are missing — so on a given host the model only sees the
  one tool it can actually use.
* `computer_use_common.py` — schema, `ActionRequest`, `ActionResult`,
  parameter validation, screen-bounds enforcement.
* `computer_use_safety.py` — env gate, kill-switch flag, JSONL action
  log under `$HERMES_HOME/logs/computer_use.jsonl`, screenshot
  redaction (PIL).
* `computer_use_grammar.py` — one parser, four targets. `Cmd+Shift+T`
  produces Quartz CGEvent flags+keycode on macOS, `xdotool key`
  string on X11, ydotool input event codes on Wayland, Win32 VK
  codes on Windows.
* `computer_use_macos.py` — Quartz `CGEvent` for input,
  `screencapture` CLI for capture, `CGWindowListCopyWindowInfo` for
  active window. pyobjc-framework-Quartz is the only new dep.
* `computer_use_linux.py` — runtime detection of X11 vs Wayland.
  X11 → `xdotool` + `scrot`/`import`. Wayland → `ydotool` + `grim`
  (wlroots) / `gnome-screenshot` / `spectacle`. Active-window queries
  via Sway IPC / hyprctl / xdotool depending on path.
* `computer_use_windows.py` — `ctypes` over `user32.SendInput` (modern
  path; avoids legacy `keybd_event`). DPI-aware on import. Screenshot
  via `mss` if installed, falls back to ctypes BitBlt + PIL otherwise.

Skills
======

Per-OS skill teaches the model what's actually different on each host:
Cmd-vs-Ctrl, Spotlight vs Win+S, X11 vs Wayland detection, UAC/UIPI,
accessibility / screen-recording perm setup, etc. The common skill
covers when to reach for `computer_use_*` at all (vs `browser_tool` /
`terminal`) and the screenshot-first discipline.

Validation
==========

* 56/56 unit tests passing (mocked Quartz / subprocess / user32 across
  all three backends + grammar + safety).
* macOS backend integration-tested live on the author's MacBook:
  screen_size, cursor_position, get_active_window, screenshot (full),
  screenshot (region crop), screenshot (with redact), wait, off-screen
  click validation, type-without-text validation, unknown-action
  validation, env-off refusal — all 11/11 cases pass.
* Linux + Windows are unit-test-only at the moment; author has no Linux
  or Windows host immediately available for end-to-end validation.
  Honest framing in the eventual PR body.

Safety posture
==============

* `HERMES_COMPUTER_USE_ENABLED=true` required. Default: refused.
* Action allowlist + per-action validation (no off-screen, no >10K
  type strings, no >30s waits, no unknown actions).
* Process-global kill-switch flag (`set_kill_switch()`) checked
  before every action — engaged once, all subsequent actions refuse
  until cleared.
* JSONL audit log of every attempt (action, params minus image bytes,
  success bit, error if any).
* `screenshot` action accepts `redact_regions` to blank rectangles
  (password manager, MFA codes) before the image reaches the model.
Abd0r added a commit to Abd0r/hermes-agent that referenced this pull request May 15, 2026
…Windows)

Three OS-specific tools — `computer_use_macos`, `computer_use_linux`,
`computer_use_windows` — sharing one JSON schema and one set of action
semantics, but with native backends per platform. Complementary to
the containerised proposal in NousResearch#15876 (which targets the
"Hermes-runs-in-Docker" deployment shape) and the macOS-Anthropic-protocol
work in NousResearch#4562 / NousResearch#13308. This PR owns the "Hermes runs natively on the
host desktop, control any of the three majors with consistent abstraction"
shape.

Architecture
============

* All three tools register at module top via `registry.register()` so
  the AST tool-discovery picks them up. `check_fn` returns False off the
  matching platform / when `HERMES_COMPUTER_USE_ENABLED` is unset / when
  required deps are missing — so on a given host the model only sees the
  one tool it can actually use.
* `computer_use_common.py` — schema, `ActionRequest`, `ActionResult`,
  parameter validation, screen-bounds enforcement.
* `computer_use_safety.py` — env gate, kill-switch flag, JSONL action
  log under `$HERMES_HOME/logs/computer_use.jsonl`, screenshot
  redaction (PIL).
* `computer_use_grammar.py` — one parser, four targets. `Cmd+Shift+T`
  produces Quartz CGEvent flags+keycode on macOS, `xdotool key`
  string on X11, ydotool input event codes on Wayland, Win32 VK
  codes on Windows.
* `computer_use_macos.py` — Quartz `CGEvent` for input,
  `screencapture` CLI for capture, `CGWindowListCopyWindowInfo` for
  active window. pyobjc-framework-Quartz is the only new dep.
* `computer_use_linux.py` — runtime detection of X11 vs Wayland.
  X11 → `xdotool` + `scrot`/`import`. Wayland → `ydotool` + `grim`
  (wlroots) / `gnome-screenshot` / `spectacle`. Active-window queries
  via Sway IPC / hyprctl / xdotool depending on path.
* `computer_use_windows.py` — `ctypes` over `user32.SendInput` (modern
  path; avoids legacy `keybd_event`). DPI-aware on import. Screenshot
  via `mss` if installed, falls back to ctypes BitBlt + PIL otherwise.

Skills
======

Per-OS skill teaches the model what's actually different on each host:
Cmd-vs-Ctrl, Spotlight vs Win+S, X11 vs Wayland detection, UAC/UIPI,
accessibility / screen-recording perm setup, etc. The common skill
covers when to reach for `computer_use_*` at all (vs `browser_tool` /
`terminal`) and the screenshot-first discipline.

Validation
==========

* 56/56 unit tests passing (mocked Quartz / subprocess / user32 across
  all three backends + grammar + safety).
* macOS backend integration-tested live on the author's MacBook:
  screen_size, cursor_position, get_active_window, screenshot (full),
  screenshot (region crop), screenshot (with redact), wait, off-screen
  click validation, type-without-text validation, unknown-action
  validation, env-off refusal — all 11/11 cases pass.
* Linux + Windows are unit-test-only at the moment; author has no Linux
  or Windows host immediately available for end-to-end validation.
  Honest framing in the eventual PR body.

Safety posture
==============

* `HERMES_COMPUTER_USE_ENABLED=true` required. Default: refused.
* Action allowlist + per-action validation (no off-screen, no >10K
  type strings, no >30s waits, no unknown actions).
* Process-global kill-switch flag (`set_kill_switch()`) checked
  before every action — engaged once, all subsequent actions refuse
  until cleared.
* JSONL audit log of every attempt (action, params minus image bytes,
  success bit, error if any).
* `screenshot` action accepts `redact_regions` to blank rectangles
  (password manager, MFA codes) before the image reaches the model.
@glasses666

Copy link
Copy Markdown
Author

Closing this stale draft branch; current work is being handled through the narrower gateway/compression PRs.

@glasses666 glasses666 closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have platform/telegram Telegram bot adapter type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants