Feature: Platform-Native Rich Interactions — Inline Keyboards, Execution Plans & Structured UI Components

## Overview

Hermes Agent communicates with users on messaging platforms almost entirely through plain text. The only exception is Discord's button-based approval UI for dangerous commands. Meanwhile, Telegram, Discord, Slack, and WhatsApp all offer rich interactive components — inline keyboards, button grids, select menus, action rows, carousels — that go completely unused.

This is a significant missed opportunity. Research into 30+ agent interfaces revealed a key theme: **"Step Collapse"** — reducing multi-turn text conversations into single structured interactions. Instead of asking "Which model do you want? Here are the options: 1. GPT-4, 2. Claude, 3. Gemini..." and waiting for a text reply, the agent should present a button grid and get a one-tap answer. Instead of describing a complex plan in prose, the agent should present a structured checklist with approve/modify/reject options.

This was inspired by the A2UI (Agent-to-User Interface) protocol by Google/Thesys, Magentic-UI's co-planning model, and the broader Generative UI paradigm.

---

## Research Findings

### The Generative UI Paradigm

Three patterns identified in the research:

1. **Static GenUI**: Pre-built UI components triggered by the agent (e.g., "show a model selector" → renders a known dropdown). Simplest, most secure.
2. **Declarative GenUI**: Agent returns a JSON UI spec, frontend renders appropriate native components. More flexible.
3. **Open-ended GenUI**: Agent returns full HTML/iframe. Maximum freedom, but security concerns.

For messaging platforms, **Static GenUI** is the right fit — the agent triggers predefined interactive components using platform-native APIs.

### Platform Capabilities (Currently Unused)

**Telegram**:
- Inline keyboards (button grids under messages)
- Reply keyboards (custom keyboard replacing the default one)
- Callback queries (button tap handlers with data payloads)
- Inline mode (@ mention in any chat triggers agent suggestions)
- Web Apps (mini web apps inside Telegram)
- Polls (native poll creation)
- Reactions

**Discord**:
- Buttons (already used for approval, but only there)
- Select menus (dropdowns with single/multi select)
- Modals (popup forms with text inputs)
- Message components (action rows with mixed button/select)
- Slash command options (typed parameters with autocomplete)
- Threads (used, but could be used more strategically)
- Embeds with structured fields

**Slack**:
- Block Kit: sections, dividers, images, actions, inputs, modals
- Buttons, select menus, date pickers, time pickers
- Overflow menus
- Interactive modals with form inputs
- Workflow steps

**WhatsApp**:
- Interactive messages: buttons (up to 3), list messages (up to 10 sections)
- Reply buttons
- Template messages with quick replies

### Execution Plans (Magentic-UI / Windsurf Inspired)

A particularly impactful application of structured UI: **execution plans**. Before the agent performs a complex multi-step task, it presents a structured plan:

```
📋 Execution Plan:
1. ☐ Read the existing test file
2. ☐ Analyze the function signatures
3. ☐ Generate test cases for each function
4. ☐ Run tests and fix failures

[▶ Execute] [✏️ Modify] [❌ Cancel]
```

This addresses a key UX concern: users often don't know what the agent is about to do until it's already doing it. Magentic-UI calls this "Co-Planning" and found it dramatically increased user trust and satisfaction.

---

## Current State in Hermes Agent

**Interactive components used today**:
- Discord: `ExecApprovalView` with Allow Once / Always Allow / Deny buttons (only for dangerous command approval)
- All other platforms: text-only interaction

**Clarify tool**: The `clarify` tool already supports multiple-choice questions, but renders them as numbered text lists, not native buttons. On Telegram, a clarify call with 4 choices sends text like "1. Option A\n2. Option B..." instead of an inline keyboard.

**Relevant code**:
- `gateway/platforms/base.py` — `PlatformAdapter` base class has `send_message` but no `send_interactive` or `send_components` method
- `gateway/platforms/discord.py` — Has `ExecApprovalView` showing the pattern works
- `tools/approval.py` — Approval system that could benefit from native UI on all platforms

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **core codebase change**. It requires modifications to the platform adapters, the clarify tool, and potentially the approval system. It touches binary/event-driven platform APIs that can't be expressed as shell commands.

### What We'd Need

1. **Base adapter extension**: Add `send_interactive()` method to `PlatformAdapter` with a platform-agnostic component model
2. **Platform-specific renderers**: Each adapter translates abstract components to native platform elements
3. **Callback handling**: Route button/menu interactions back to the agent
4. **Clarify tool upgrade**: Use native components instead of text lists
5. **New `present_plan` tool** or enhancement to `todo` tool for structured execution plans

### Component Model (Platform-Agnostic)

```python
# Abstract components that each platform renders natively
ButtonGrid(buttons=[Button(label="GPT-4", data="gpt4"), ...])
SelectMenu(options=[Option(label="High", value="high"), ...], placeholder="Choose effort")
Checklist(items=[CheckItem(text="Read tests", checked=False), ...], actions=["Execute", "Cancel"])
Confirmation(text="Delete 47 files?", confirm="Delete", deny="Cancel")
Poll(question="Which approach?", options=["A: Refactor", "B: Rewrite"])
```

### Phased Rollout

**Phase 1: Clarify Tool + Approval Upgrade**
- Modify `clarify` tool to emit structured choice data (not just text)
- Each platform adapter renders choices as native components:
  - Telegram: `InlineKeyboardMarkup` with callback buttons
  - Discord: `View` with `Button` components (extend existing pattern)
  - Slack: Block Kit `actions` with buttons
  - WhatsApp: Interactive button messages (up to 3) or list messages
  - CLI: numbered list with keyboard input (current behavior, enhanced)
- Upgrade approval flow to use native buttons on Telegram and Slack (already works on Discord)
- Handle callback routing: platform receives button tap → resolves pending clarify/approval future

**Phase 2: Execution Plans & Structured Outputs**
- New `present_plan` tool or enhancement to clarify for structured plans
- Agent can present a numbered plan with approve/modify/cancel actions
- Render as:
  - Telegram: Message with inline keyboard (Execute / Modify / Cancel)
  - Discord: Embed with action row buttons
  - Slack: Block Kit with sections and action buttons
  - CLI: Formatted plan with input prompt
- Track plan execution progress (update the plan message with ☑ as steps complete)
- Integrate with `todo` tool — present the todo list as interactive UI

**Phase 3: Rich Agent Outputs**
- Structured data display: tables rendered as platform-native formats
- Progress cards: show long-running task progress as updating embeds/messages
- Result summaries: present key findings as structured cards, not prose
- Quick action suggestions: after completing a task, offer "What's next?" buttons
- Telegram Web Apps: for complex interactions, open a mini web app inside Telegram
- Polling: use native polls for user preference collection

---

## Pros & Cons

### Pros
- Dramatic UX improvement — one tap instead of typing "option 2"
- Reduces conversation turns (Step Collapse)
- Increases user trust via visible execution plans
- Uses platform capabilities that are already built and free
- Makes the agent feel native to each platform, not like a text bot
- Low-risk: can be rolled out incrementally, starting with clarify

### Cons / Risks
- Platform divergence: each platform has different component capabilities and limits
- Callback routing adds complexity to the gateway event loop
- WhatsApp has the most limited interactive components (3 buttons max)
- CLI doesn't have "buttons" — need graceful degradation to text
- Agent needs to learn when to use structured UI vs. prose (prompt engineering)
- Rate limits on interactive messages differ by platform

---

## Open Questions

- Should the agent decide when to use structured UI, or should specific tools always produce it?
- How do we handle platforms with limited component support (WhatsApp: 3 buttons max)?
- Should execution plans be opt-in (user requests them) or default for complex tasks?
- Should we support Telegram Web Apps for complex form inputs, or keep it simple with inline keyboards?
- How do we handle callback timeouts? (Telegram callbacks expire after ~30 seconds)

---

## References

- [A2UI Protocol (Google/Thesys)](https://github.com/thesys-ai/a2ui) — Structured UI generation for agents
- [Magentic-UI (Microsoft)](https://github.com/microsoft/magentic-ui) — Co-planning and action guards
- [Telegram Bot API - Inline Keyboards](https://core.telegram.org/bots/api#inlinekeyboardmarkup)
- [Discord Components](https://discord.com/developers/docs/interactions/message-components)
- [Slack Block Kit](https://api.slack.com/block-kit)
- [WhatsApp Interactive Messages](https://developers.facebook.com/docs/whatsapp/guides/interactive-messages)
- [Generative UI paradigm discussion](https://www.ag-ui.com/concepts/generative-ui)
- Existing Hermes code: `gateway/platforms/discord.py` `ExecApprovalView` (line 769+) as working pattern
- Related Hermes issues: #345 (Message Coalescing), #322 (Interactive tool selection)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Platform-Native Rich Interactions — Inline Keyboards, Execution Plans & Structured UI Components #503

Overview

Research Findings

The Generative UI Paradigm

Platform Capabilities (Currently Unused)

Execution Plans (Magentic-UI / Windsurf Inspired)

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Component Model (Platform-Agnostic)

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Platform-Native Rich Interactions — Inline Keyboards, Execution Plans & Structured UI Components #503

Description

Overview

Research Findings

The Generative UI Paradigm

Platform Capabilities (Currently Unused)

Execution Plans (Magentic-UI / Windsurf Inspired)

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Component Model (Platform-Agnostic)

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions