Overview
Hermes Agent communicates with users on messaging platforms almost entirely through plain text. The only exception is Discord's button-based approval UI for dangerous commands. Meanwhile, Telegram, Discord, Slack, and WhatsApp all offer rich interactive components — inline keyboards, button grids, select menus, action rows, carousels — that go completely unused.
This is a significant missed opportunity. Research into 30+ agent interfaces revealed a key theme: "Step Collapse" — reducing multi-turn text conversations into single structured interactions. Instead of asking "Which model do you want? Here are the options: 1. GPT-4, 2. Claude, 3. Gemini..." and waiting for a text reply, the agent should present a button grid and get a one-tap answer. Instead of describing a complex plan in prose, the agent should present a structured checklist with approve/modify/reject options.
This was inspired by the A2UI (Agent-to-User Interface) protocol by Google/Thesys, Magentic-UI's co-planning model, and the broader Generative UI paradigm.
Research Findings
The Generative UI Paradigm
Three patterns identified in the research:
- Static GenUI: Pre-built UI components triggered by the agent (e.g., "show a model selector" → renders a known dropdown). Simplest, most secure.
- Declarative GenUI: Agent returns a JSON UI spec, frontend renders appropriate native components. More flexible.
- Open-ended GenUI: Agent returns full HTML/iframe. Maximum freedom, but security concerns.
For messaging platforms, Static GenUI is the right fit — the agent triggers predefined interactive components using platform-native APIs.
Platform Capabilities (Currently Unused)
Telegram:
- Inline keyboards (button grids under messages)
- Reply keyboards (custom keyboard replacing the default one)
- Callback queries (button tap handlers with data payloads)
- Inline mode (@ mention in any chat triggers agent suggestions)
- Web Apps (mini web apps inside Telegram)
- Polls (native poll creation)
- Reactions
Discord:
- Buttons (already used for approval, but only there)
- Select menus (dropdowns with single/multi select)
- Modals (popup forms with text inputs)
- Message components (action rows with mixed button/select)
- Slash command options (typed parameters with autocomplete)
- Threads (used, but could be used more strategically)
- Embeds with structured fields
Slack:
- Block Kit: sections, dividers, images, actions, inputs, modals
- Buttons, select menus, date pickers, time pickers
- Overflow menus
- Interactive modals with form inputs
- Workflow steps
WhatsApp:
- Interactive messages: buttons (up to 3), list messages (up to 10 sections)
- Reply buttons
- Template messages with quick replies
Execution Plans (Magentic-UI / Windsurf Inspired)
A particularly impactful application of structured UI: execution plans. Before the agent performs a complex multi-step task, it presents a structured plan:
📋 Execution Plan:
1. ☐ Read the existing test file
2. ☐ Analyze the function signatures
3. ☐ Generate test cases for each function
4. ☐ Run tests and fix failures
[▶ Execute] [✏️ Modify] [❌ Cancel]
This addresses a key UX concern: users often don't know what the agent is about to do until it's already doing it. Magentic-UI calls this "Co-Planning" and found it dramatically increased user trust and satisfaction.
Current State in Hermes Agent
Interactive components used today:
- Discord:
ExecApprovalView with Allow Once / Always Allow / Deny buttons (only for dangerous command approval)
- All other platforms: text-only interaction
Clarify tool: The clarify tool already supports multiple-choice questions, but renders them as numbered text lists, not native buttons. On Telegram, a clarify call with 4 choices sends text like "1. Option A\n2. Option B..." instead of an inline keyboard.
Relevant code:
gateway/platforms/base.py — PlatformAdapter base class has send_message but no send_interactive or send_components method
gateway/platforms/discord.py — Has ExecApprovalView showing the pattern works
tools/approval.py — Approval system that could benefit from native UI on all platforms
Implementation Plan
Skill vs. Tool Classification
This should be a core codebase change. It requires modifications to the platform adapters, the clarify tool, and potentially the approval system. It touches binary/event-driven platform APIs that can't be expressed as shell commands.
What We'd Need
- Base adapter extension: Add
send_interactive() method to PlatformAdapter with a platform-agnostic component model
- Platform-specific renderers: Each adapter translates abstract components to native platform elements
- Callback handling: Route button/menu interactions back to the agent
- Clarify tool upgrade: Use native components instead of text lists
- New
present_plan tool or enhancement to todo tool for structured execution plans
Component Model (Platform-Agnostic)
# Abstract components that each platform renders natively
ButtonGrid(buttons=[Button(label="GPT-4", data="gpt4"), ...])
SelectMenu(options=[Option(label="High", value="high"), ...], placeholder="Choose effort")
Checklist(items=[CheckItem(text="Read tests", checked=False), ...], actions=["Execute", "Cancel"])
Confirmation(text="Delete 47 files?", confirm="Delete", deny="Cancel")
Poll(question="Which approach?", options=["A: Refactor", "B: Rewrite"])
Phased Rollout
Phase 1: Clarify Tool + Approval Upgrade
- Modify
clarify tool to emit structured choice data (not just text)
- Each platform adapter renders choices as native components:
- Telegram:
InlineKeyboardMarkup with callback buttons
- Discord:
View with Button components (extend existing pattern)
- Slack: Block Kit
actions with buttons
- WhatsApp: Interactive button messages (up to 3) or list messages
- CLI: numbered list with keyboard input (current behavior, enhanced)
- Upgrade approval flow to use native buttons on Telegram and Slack (already works on Discord)
- Handle callback routing: platform receives button tap → resolves pending clarify/approval future
Phase 2: Execution Plans & Structured Outputs
- New
present_plan tool or enhancement to clarify for structured plans
- Agent can present a numbered plan with approve/modify/cancel actions
- Render as:
- Telegram: Message with inline keyboard (Execute / Modify / Cancel)
- Discord: Embed with action row buttons
- Slack: Block Kit with sections and action buttons
- CLI: Formatted plan with input prompt
- Track plan execution progress (update the plan message with ☑ as steps complete)
- Integrate with
todo tool — present the todo list as interactive UI
Phase 3: Rich Agent Outputs
- Structured data display: tables rendered as platform-native formats
- Progress cards: show long-running task progress as updating embeds/messages
- Result summaries: present key findings as structured cards, not prose
- Quick action suggestions: after completing a task, offer "What's next?" buttons
- Telegram Web Apps: for complex interactions, open a mini web app inside Telegram
- Polling: use native polls for user preference collection
Pros & Cons
Pros
- Dramatic UX improvement — one tap instead of typing "option 2"
- Reduces conversation turns (Step Collapse)
- Increases user trust via visible execution plans
- Uses platform capabilities that are already built and free
- Makes the agent feel native to each platform, not like a text bot
- Low-risk: can be rolled out incrementally, starting with clarify
Cons / Risks
- Platform divergence: each platform has different component capabilities and limits
- Callback routing adds complexity to the gateway event loop
- WhatsApp has the most limited interactive components (3 buttons max)
- CLI doesn't have "buttons" — need graceful degradation to text
- Agent needs to learn when to use structured UI vs. prose (prompt engineering)
- Rate limits on interactive messages differ by platform
Open Questions
- Should the agent decide when to use structured UI, or should specific tools always produce it?
- How do we handle platforms with limited component support (WhatsApp: 3 buttons max)?
- Should execution plans be opt-in (user requests them) or default for complex tasks?
- Should we support Telegram Web Apps for complex form inputs, or keep it simple with inline keyboards?
- How do we handle callback timeouts? (Telegram callbacks expire after ~30 seconds)
References
Overview
Hermes Agent communicates with users on messaging platforms almost entirely through plain text. The only exception is Discord's button-based approval UI for dangerous commands. Meanwhile, Telegram, Discord, Slack, and WhatsApp all offer rich interactive components — inline keyboards, button grids, select menus, action rows, carousels — that go completely unused.
This is a significant missed opportunity. Research into 30+ agent interfaces revealed a key theme: "Step Collapse" — reducing multi-turn text conversations into single structured interactions. Instead of asking "Which model do you want? Here are the options: 1. GPT-4, 2. Claude, 3. Gemini..." and waiting for a text reply, the agent should present a button grid and get a one-tap answer. Instead of describing a complex plan in prose, the agent should present a structured checklist with approve/modify/reject options.
This was inspired by the A2UI (Agent-to-User Interface) protocol by Google/Thesys, Magentic-UI's co-planning model, and the broader Generative UI paradigm.
Research Findings
The Generative UI Paradigm
Three patterns identified in the research:
For messaging platforms, Static GenUI is the right fit — the agent triggers predefined interactive components using platform-native APIs.
Platform Capabilities (Currently Unused)
Telegram:
Discord:
Slack:
WhatsApp:
Execution Plans (Magentic-UI / Windsurf Inspired)
A particularly impactful application of structured UI: execution plans. Before the agent performs a complex multi-step task, it presents a structured plan:
This addresses a key UX concern: users often don't know what the agent is about to do until it's already doing it. Magentic-UI calls this "Co-Planning" and found it dramatically increased user trust and satisfaction.
Current State in Hermes Agent
Interactive components used today:
ExecApprovalViewwith Allow Once / Always Allow / Deny buttons (only for dangerous command approval)Clarify tool: The
clarifytool already supports multiple-choice questions, but renders them as numbered text lists, not native buttons. On Telegram, a clarify call with 4 choices sends text like "1. Option A\n2. Option B..." instead of an inline keyboard.Relevant code:
gateway/platforms/base.py—PlatformAdapterbase class hassend_messagebut nosend_interactiveorsend_componentsmethodgateway/platforms/discord.py— HasExecApprovalViewshowing the pattern workstools/approval.py— Approval system that could benefit from native UI on all platformsImplementation Plan
Skill vs. Tool Classification
This should be a core codebase change. It requires modifications to the platform adapters, the clarify tool, and potentially the approval system. It touches binary/event-driven platform APIs that can't be expressed as shell commands.
What We'd Need
send_interactive()method toPlatformAdapterwith a platform-agnostic component modelpresent_plantool or enhancement totodotool for structured execution plansComponent Model (Platform-Agnostic)
Phased Rollout
Phase 1: Clarify Tool + Approval Upgrade
clarifytool to emit structured choice data (not just text)InlineKeyboardMarkupwith callback buttonsViewwithButtoncomponents (extend existing pattern)actionswith buttonsPhase 2: Execution Plans & Structured Outputs
present_plantool or enhancement to clarify for structured planstodotool — present the todo list as interactive UIPhase 3: Rich Agent Outputs
Pros & Cons
Pros
Cons / Risks
Open Questions
References
gateway/platforms/discord.pyExecApprovalView(line 769+) as working pattern