Summary
Tencent launched Marvis on May 20, 2026 — an OS-level AI assistant that embeds between the user and the operating system. After studying its architecture in depth, I identified 4 design patterns that could significantly improve Hermes without requiring a rewrite. This issue maps each pattern to Hermes' existing architecture with concrete implementation suggestions.
Note for international team: Marvis is a China-market product (QQ login required, Chinese UI). I've extracted all the relevant architecture details below so you don't need to register or download it. The value is in the design patterns, not the product itself.
What is Marvis? (Quick Context)
Marvis is not a chatbot. It's an AI middleware layer sitting between the user and the OS:
Traditional AI: User → Chat Window → Text Output
Marvis: User → Natural Language → OS APIs → Files / Settings / Apps / Hardware
Key stats (for context, not to copy):
- Built by Tencent's App Store team (PC ecosystem veterans)
- Windows + Mac + Android (iOS coming), cross-device sync
- Free tier: 10 million tokens/day (Tencent eats the cloud cost)
- Has strategic partnership with Microsoft for Windows API access
- Launched May 20, 2026 — less than a week ago, already gaining traction
The 4 Design Patterns Hermes Should Adopt
🥇 Pattern 1: Pre-Configured Specialized Sub-Agents ("1+5 Architecture")
What Marvis does
Ships with 1 PM Agent (orchestrator) + 5 specialist agents pre-configured out of the box:
User Natural Language
↓
┌─────────────────┐
│ PM Agent │ Understands intent, decomposes tasks, parallel dispatches
│ (Orchestrator) │ Powered by Hunyuan + DeepSeek V4 (cloud)
└──┬──┬──┬──┬──┬──┘
│ │ │ │ │
↓ ↓ ↓ ↓ ↓
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ File │ │System│ │ App │ │Browser│ │Search│
│Agent │ │Agent │ │Agent │ │Agent │ │Agent │
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘
Each specialist has a fixed, narrow scope:
| Agent |
Scope |
Implementation |
| File Agent |
File search/read/write/convert, image content search, OCR |
Local file index + semantic search |
| Computer Agent |
System settings, hardware diagnostics, cleanup |
Windows API direct calls (not simulated clicks) |
| App Agent |
GUI control of desktop + Android apps |
Visual recognition + simulated input |
| Browser Agent |
Web interaction, data scraping, form filling |
Browser takeover + DOM parsing |
| Search Agent |
Web search + information aggregation |
Search engine API calls |
The PM Agent uses a structured task dispatch protocol — not just forwarding the user's raw text, but packaging it with context, history, dependencies, and expected output schema.
Hermes current state
Concrete proposal
Step 1 — Define agent profiles in config:
# ~/.hermes/agent_profiles.yaml
profiles:
file-agent:
description: "File search, read, write, convert, OCR"
toolsets: [terminal, file, vision]
system_prompt: |
You are a file specialist. Your job is to locate, read, and process files.
- Search by content, not just filename
- For images: use vision tools to describe content
- Always return absolute file paths in results
preferred_model: "deepseek-v4-flash" # cheaper for simple file ops
system-agent:
description: "System diagnostics, settings, cleanup"
toolsets: [terminal]
system_prompt: |
You are a system operations specialist.
- Diagnose system issues (disk, memory, network, processes)
- NEVER run destructive commands (rm -rf, format, dd) without explicit confirmation
- Prefer read-only diagnostics first
risk_level: medium
browser-agent:
description: "Web interaction, scraping, form automation"
toolsets: [browser]
system_prompt: |
You are a web interaction specialist.
- Navigate, extract, fill forms, click buttons
- Always return the URL you're on and what you found
search-agent:
description: "Web search and information aggregation"
toolsets: [web]
system_prompt: |
You are a search and research specialist.
- Search broadly first, then drill down
- Always cite sources with URLs
- Synthesize findings into a structured summary
Step 2 — Extend delegate_task to accept profile names:
# Current:
delegate_task(goal="find all invoices", toolsets=["terminal", "file"])
# Proposed:
delegate_task(goal="find all invoices", profile="file-agent")
# Resolves to: toolsets + system_prompt + model from agent_profiles.yaml
Step 3 — PM Agent auto-routing (future, optional):
The main Agent could auto-detect task type and route to the right specialist:
# In run_agent.py, before delegation:
task_type = classify_task(user_message) # "file_ops" | "system_ops" | "web_search" | ...
if task_type in agent_profiles:
delegate_task(goal=user_message, profile=task_type)
This is the lowest-hanging fruit — it leverages existing infrastructure and is tracked in #9459.
🥈 Pattern 2: Cloud-Local Auto Routing
What Marvis does
Marvis doesn't make users manually choose between cloud and local models. It auto-routes based on task characteristics:
User Input → PM Agent analyzes:
├─ "Organize my invoices" → Cloud intent understanding + Local file search
├─ "Review this contract's risk" → Pure local, no cloud upload (privacy)
└─ "What's the weather today" → Cloud search
The routing logic:
| Factor |
Routes to Cloud |
Routes to Local |
| Task complexity |
Multi-step planning, ambiguous intent |
Simple, well-defined operations |
| Data sensitivity |
Public information |
Personal files, contracts, financial data |
| Connectivity requirement |
Web search, API calls |
File ops, system settings |
| Cost sensitivity |
Complex reasoning needs big models |
Simple tasks waste cloud tokens |
The key insight: Marvis does heavy pre-processing locally (file indexing, image OCR, text extraction) BEFORE sending anything to the cloud. This means cloud models get structured, pre-digested input instead of raw data — cutting token usage by 60-80%.
Hermes current state
- ✅ Supports multiple model providers
- ✅
/model command for manual switching
- ❌ One model per session — no runtime routing
- ❌ All tool output goes to the cloud model regardless of sensitivity
- ❌ No local pre-processing pipeline
Concrete proposal
Step 1 — Task classifier (lightweight, rule-based first):
# ~/.hermes/routing.yaml
routing:
enabled: true
default_model: "deepseek-v4-pro" # cloud, for general use
local_models:
primary: "qwen2.5-7b-local" # via ollama or similar
fallback: "deepseek-v4-flash" # cheap cloud if local unavailable
rules:
- name: "privacy-sensitive"
patterns:
keywords: ["contract", "financial", "passport", "password", "confidential", "NDA"]
file_paths: ["~/Documents/finance/", "~/Desktop/tax/"]
route_to: local
reason: "contains sensitive data"
- name: "simple-file-ops"
patterns:
intents: ["read_file", "list_directory", "find_file", "file_stats"]
route_to: local
reason: "simple operation, no cloud needed"
- name: "web-search"
patterns:
intents: ["web_search", "browser_navigate"]
route_to: cloud
reason: "requires internet access"
- name: "complex-reasoning"
patterns:
keywords: ["explain", "analyze", "compare", "refactor", "design"]
min_complexity: 0.7 # heuristic score
route_to: cloud
reason: "needs large model reasoning"
Step 2 — Local pre-processing pipeline:
Before sending context to the cloud model, run a lightweight local model to:
- Extract key information from tool outputs (summarize large file contents)
- Classify task sensitivity
- Structure raw data (JSON-ify free text)
This mirrors what Hermes already does with context compression (#31684), but using a local model instead of just truncation.
Step 3 — Integration point in run_agent.py:
# Pseudo-code for where routing would hook in:
def select_model(self, user_message: str, context: dict) -> str:
if not self.routing_enabled:
return self.default_model
task_type = self.classify_task(user_message)
sensitivity = self.assess_sensitivity(user_message, context)
if sensitivity == "high" or task_type in ["simple_file_ops"]:
return self.route_to_local()
elif task_type in ["web_search", "complex_reasoning"]:
return self.route_to_cloud()
return self.default_model
🥉 Pattern 3: Desktop/GUI Agent Capabilities
What Marvis does
Marvis can see and control desktop applications — not just terminal. Examples:
- "Open WeChat, find the last message from Mom, tell her I'll be late"
- "Open my stock trading app, check AAPL price, screenshot the chart"
- "Find the Windows setting that disables lock screen ads" (uses Windows API, not clicking around blindly)
Implementation: Windows API direct calls (via Microsoft partnership) + GUI visual recognition for apps without APIs.
Hermes current state
Concrete proposal (phased)
Phase 1 — Screen Capture + Click/Type (low effort, WSL-compatible):
Add a desktop toolset with minimal primitives:
# New tools (Linux via xdotool/ydotool, Windows via existing Win API)
@register_tool(toolset="desktop", risk_level="medium")
def desktop_screenshot(region: str = "full") -> str:
"""Capture screen and return path for vision analysis.
region: 'full' | 'active_window' | 'selection'"""
# Linux: import -window root /tmp/screenshot.png
# Windows: existing win32 API
pass
@register_tool(toolset="desktop", risk_level="medium")
def desktop_click(x: int, y: int, button: str = "left") -> str:
"""Click at screen coordinates."""
# Linux: xdotool mousemove X Y click 1
# Windows: SetCursorPos + mouse_event
pass
@register_tool(toolset="desktop", risk_level="medium")
def desktop_type(text: str) -> str:
"""Type text at current focus."""
# Linux: xdotool type "..."
pass
@register_tool(toolset="desktop", risk_level="low")
def desktop_list_windows() -> list:
"""List all open windows with titles and positions."""
# Linux: wmctrl -l
# Windows: EnumWindows
pass
This alone would enable use cases like:
- "Take a screenshot of this error dialog and explain what's wrong"
- "Fill out this form in my browser" (Agent sees the screen, identifies fields, types)
- "Close all Slack windows"
Phase 2 — OS API Integration (higher effort, platform-specific):
# Dedicated platform adapters (similar to terminal backends)
@register_tool(toolset="desktop")
def system_setting_get(key: str) -> str:
"""Read a system setting by name."""
# Windows: registry or Settings app API
# Linux: gsettings / dconf
# macOS: defaults read
pass
@register_tool(toolset="desktop")
def file_search_semantic(query: str, path: str = "~") -> list:
"""Find files by content description, not just filename.
'the PDF invoice from last month' → finds the right file"""
# Uses local embedding model to index files
pass
WSL consideration: Since many Hermes users (especially developers on Windows) use WSL, the desktop tools should detect the environment and proxy through the Windows host when running in WSL. The existing wsl-windows-interop patterns in Hermes could be extended.
4️⃣ Pattern 4: Tiered Security Classification
What Marvis does
Every operation is classified into 3 risk tiers with automatic enforcement:
| Tier |
Examples |
Handling |
| 🟢 Low |
Read files, search, display info |
AI auto-executes |
| 🟡 Medium |
Delete files, modify settings, install software |
AI proposes plan → User must confirm → Execute |
| 🔴 High |
Payments, transfers, auth changes, sudo rm -rf / |
AI forbidden from executing. Must be done manually. |
This is NOT just a prompt-level suggestion — it's enforced at the execution layer. Medium-risk operations trigger a "hard check" (L2 hard inquiry) that cannot be bypassed by the AI.
Hermes current state
- ✅ Terminal confirmation dialogs for dangerous commands (implicit)
- ✅
requires_approval parameter in some tool calls
- ❌ No systematic risk classification across all tools
- ❌ Risk level is not visible to the Agent's reasoning
- ❌ Some dangerous operations can slip through by phrasing
Concrete proposal
Step 1 — Add risk_level to tool registration:
# In tools/registry.py or tool definitions
@register_tool(risk_level="low")
def read_file(path: str, ...): ...
@register_tool(risk_level="medium")
def delete_file(path: str): ...
@register_tool(risk_level="high", human_only=True)
def execute_payment(amount: float, ...): ...
Step 2 — Inject risk rules into system prompt:
# Auto-generated from tool registry, injected into every Agent session
RISK_RULES = """
## Operation Safety Tiers
When using tools, always respect these risk levels:
🟢 LOW risk (auto-execute): read_file, search, grep, list, stat, cat
→ Execute immediately, no confirmation needed
🟡 MEDIUM risk (confirm first): delete, move, write, install, systemctl, chmod
→ Before executing: explain what you're about to do and wait for confirmation
→ NEVER batch medium-risk operations without individual confirmation
🔴 HIGH risk (FORBIDDEN): payments, sudo destructive, auth changes, rm -rf /
→ DO NOT execute these under any circumstances
→ Tell the user to perform these operations themselves
"""
Step 3 — Enforce at execution layer (not just prompt):
# In model_tools.py handle_function_call()
def handle_function_call(tool_name: str, args: dict, task_id: str):
risk = TOOL_RISK_LEVELS.get(tool_name, "medium") # default medium
if risk == "high" and TOOL_HUMAN_ONLY.get(tool_name, False):
return {"error": "This operation requires human execution. Refusing."}
if risk == "medium" and not confirmation_granted(tool_name, args, task_id):
return {"requires_confirmation": True, "plan": f"About to run {tool_name} with {args}"}
return execute_tool(tool_name, args)
What NOT to Copy from Marvis
These are intentional anti-patterns that Hermes should avoid:
| Marvis weakness |
Why Hermes should NOT adopt it |
| Closed ecosystem — cross-device relies on proprietary Tencent app engine |
Hermes' MCP + A2A + multi-platform gateway is the right open approach |
| Non-extensible — users cannot create custom Skills/workflows |
Hermes' Skill system is a core differentiator; keep it flexible |
| Model lock-in — local = Qwen only, cloud = Hunyuan/DS only |
Hermes' provider-agnostic design is a major strength |
| No long-running autonomy — no cron, no idle loops, no overnight tasks |
Hermes' cron system + background processes are already superior |
| Search quality issues — early users report poor retrieval accuracy |
Hermes' multi-search backends (anysearch, browser, etc.) provide better flexibility |
Related Existing Issues
Why Now
Marvis launched 5 days ago. It's the first consumer product that proves:
- Multi-Agent architecture is not academic — 6 agents working in parallel, on consumer hardware, for non-technical users
- OS-level AI middleware is viable — users are willing to let AI control their desktop if safety guarantees are clear
- Cloud-local routing saves real money — Tencent's free 10M token/day is only sustainable because >60% of processing stays local
- GUI Agent is the natural extension of Code Agent — the jump from "AI that writes code" to "AI that operates your computer" is smaller than it seems
Hermes already has better infrastructure than Marvis in many ways (MCP, multi-model, cron, Skills, platforms gateway). What's missing is:
- Default configurations that make the infrastructure accessible (Pattern 1)
- Intelligence layer that routes between capabilities (Pattern 2)
- Desktop reach beyond the terminal (Pattern 3)
- Safety guardrails that make desktop control trustworthy (Pattern 4)
None of these require a rewrite — they're incremental additions to existing architecture.
Implementation Priority (my recommendation)
| Priority |
Pattern |
Effort |
Impact |
Dependency |
| 🥇 P0 |
Pattern 4: Security Tiers |
~200 LOC |
Safety baseline for all other patterns |
None |
| 🥈 P1 |
Pattern 1: Pre-built Agent Profiles |
~500 LOC |
Unlocks delegation UX |
#9459 |
| 🥉 P2 |
Pattern 2: Cloud-Local Routing |
~800 LOC |
Cost optimization |
Pattern 1 for routing targets |
| P3 |
Pattern 3: Desktop GUI Agent |
~2000+ LOC |
Major differentiation |
Pattern 4 for safety |
Security first (Pattern 4), then pre-built agents (Pattern 1), then smart routing (Pattern 2), then desktop reach (Pattern 3). Each builds on the previous.
Researched and drafted by a Hermes user who also runs OpenClaw. Happy to help with testing or refining any of these ideas.
Summary
Tencent launched Marvis on May 20, 2026 — an OS-level AI assistant that embeds between the user and the operating system. After studying its architecture in depth, I identified 4 design patterns that could significantly improve Hermes without requiring a rewrite. This issue maps each pattern to Hermes' existing architecture with concrete implementation suggestions.
What is Marvis? (Quick Context)
Marvis is not a chatbot. It's an AI middleware layer sitting between the user and the OS:
Key stats (for context, not to copy):
The 4 Design Patterns Hermes Should Adopt
🥇 Pattern 1: Pre-Configured Specialized Sub-Agents ("1+5 Architecture")
What Marvis does
Ships with 1 PM Agent (orchestrator) + 5 specialist agents pre-configured out of the box:
Each specialist has a fixed, narrow scope:
The PM Agent uses a structured task dispatch protocol — not just forwarding the user's raw text, but packaging it with context, history, dependencies, and expected output schema.
Hermes current state
delegate_taskfor spawning sub-agentstoolsetsfor scoping sub-agent capabilitiesConcrete proposal
Step 1 — Define agent profiles in config:
Step 2 — Extend
delegate_taskto accept profile names:Step 3 — PM Agent auto-routing (future, optional):
The main Agent could auto-detect task type and route to the right specialist:
This is the lowest-hanging fruit — it leverages existing infrastructure and is tracked in #9459.
🥈 Pattern 2: Cloud-Local Auto Routing
What Marvis does
Marvis doesn't make users manually choose between cloud and local models. It auto-routes based on task characteristics:
The routing logic:
The key insight: Marvis does heavy pre-processing locally (file indexing, image OCR, text extraction) BEFORE sending anything to the cloud. This means cloud models get structured, pre-digested input instead of raw data — cutting token usage by 60-80%.
Hermes current state
/modelcommand for manual switchingConcrete proposal
Step 1 — Task classifier (lightweight, rule-based first):
Step 2 — Local pre-processing pipeline:
Before sending context to the cloud model, run a lightweight local model to:
This mirrors what Hermes already does with context compression (#31684), but using a local model instead of just truncation.
Step 3 — Integration point in
run_agent.py:🥉 Pattern 3: Desktop/GUI Agent Capabilities
What Marvis does
Marvis can see and control desktop applications — not just terminal. Examples:
Implementation: Windows API direct calls (via Microsoft partnership) + GUI visual recognition for apps without APIs.
Hermes current state
vision_analyzefor image understandingConcrete proposal (phased)
Phase 1 — Screen Capture + Click/Type (low effort, WSL-compatible):
Add a
desktoptoolset with minimal primitives:This alone would enable use cases like:
Phase 2 — OS API Integration (higher effort, platform-specific):
WSL consideration: Since many Hermes users (especially developers on Windows) use WSL, the desktop tools should detect the environment and proxy through the Windows host when running in WSL. The existing
wsl-windows-interoppatterns in Hermes could be extended.4️⃣ Pattern 4: Tiered Security Classification
What Marvis does
Every operation is classified into 3 risk tiers with automatic enforcement:
sudo rm -rf /This is NOT just a prompt-level suggestion — it's enforced at the execution layer. Medium-risk operations trigger a "hard check" (L2 hard inquiry) that cannot be bypassed by the AI.
Hermes current state
requires_approvalparameter in some tool callsConcrete proposal
Step 1 — Add
risk_levelto tool registration:Step 2 — Inject risk rules into system prompt:
Step 3 — Enforce at execution layer (not just prompt):
What NOT to Copy from Marvis
These are intentional anti-patterns that Hermes should avoid:
Related Existing Issues
Why Now
Marvis launched 5 days ago. It's the first consumer product that proves:
Hermes already has better infrastructure than Marvis in many ways (MCP, multi-model, cron, Skills, platforms gateway). What's missing is:
None of these require a rewrite — they're incremental additions to existing architecture.
Implementation Priority (my recommendation)
Security first (Pattern 4), then pre-built agents (Pattern 1), then smart routing (Pattern 2), then desktop reach (Pattern 3). Each builds on the previous.
Researched and drafted by a Hermes user who also runs OpenClaw. Happy to help with testing or refining any of these ideas.