Skip to content

Feature: Platform-Native Rich Interactions — Inline Keyboards, Execution Plans & Structured UI Components #503

@teknium1

Description

@teknium1

Overview

Hermes Agent communicates with users on messaging platforms almost entirely through plain text. The only exception is Discord's button-based approval UI for dangerous commands. Meanwhile, Telegram, Discord, Slack, and WhatsApp all offer rich interactive components — inline keyboards, button grids, select menus, action rows, carousels — that go completely unused.

This is a significant missed opportunity. Research into 30+ agent interfaces revealed a key theme: "Step Collapse" — reducing multi-turn text conversations into single structured interactions. Instead of asking "Which model do you want? Here are the options: 1. GPT-4, 2. Claude, 3. Gemini..." and waiting for a text reply, the agent should present a button grid and get a one-tap answer. Instead of describing a complex plan in prose, the agent should present a structured checklist with approve/modify/reject options.

This was inspired by the A2UI (Agent-to-User Interface) protocol by Google/Thesys, Magentic-UI's co-planning model, and the broader Generative UI paradigm.


Research Findings

The Generative UI Paradigm

Three patterns identified in the research:

  1. Static GenUI: Pre-built UI components triggered by the agent (e.g., "show a model selector" → renders a known dropdown). Simplest, most secure.
  2. Declarative GenUI: Agent returns a JSON UI spec, frontend renders appropriate native components. More flexible.
  3. Open-ended GenUI: Agent returns full HTML/iframe. Maximum freedom, but security concerns.

For messaging platforms, Static GenUI is the right fit — the agent triggers predefined interactive components using platform-native APIs.

Platform Capabilities (Currently Unused)

Telegram:

  • Inline keyboards (button grids under messages)
  • Reply keyboards (custom keyboard replacing the default one)
  • Callback queries (button tap handlers with data payloads)
  • Inline mode (@ mention in any chat triggers agent suggestions)
  • Web Apps (mini web apps inside Telegram)
  • Polls (native poll creation)
  • Reactions

Discord:

  • Buttons (already used for approval, but only there)
  • Select menus (dropdowns with single/multi select)
  • Modals (popup forms with text inputs)
  • Message components (action rows with mixed button/select)
  • Slash command options (typed parameters with autocomplete)
  • Threads (used, but could be used more strategically)
  • Embeds with structured fields

Slack:

  • Block Kit: sections, dividers, images, actions, inputs, modals
  • Buttons, select menus, date pickers, time pickers
  • Overflow menus
  • Interactive modals with form inputs
  • Workflow steps

WhatsApp:

  • Interactive messages: buttons (up to 3), list messages (up to 10 sections)
  • Reply buttons
  • Template messages with quick replies

Execution Plans (Magentic-UI / Windsurf Inspired)

A particularly impactful application of structured UI: execution plans. Before the agent performs a complex multi-step task, it presents a structured plan:

📋 Execution Plan:
1. ☐ Read the existing test file
2. ☐ Analyze the function signatures
3. ☐ Generate test cases for each function
4. ☐ Run tests and fix failures

[▶ Execute] [✏️ Modify] [❌ Cancel]

This addresses a key UX concern: users often don't know what the agent is about to do until it's already doing it. Magentic-UI calls this "Co-Planning" and found it dramatically increased user trust and satisfaction.


Current State in Hermes Agent

Interactive components used today:

  • Discord: ExecApprovalView with Allow Once / Always Allow / Deny buttons (only for dangerous command approval)
  • All other platforms: text-only interaction

Clarify tool: The clarify tool already supports multiple-choice questions, but renders them as numbered text lists, not native buttons. On Telegram, a clarify call with 4 choices sends text like "1. Option A\n2. Option B..." instead of an inline keyboard.

Relevant code:

  • gateway/platforms/base.pyPlatformAdapter base class has send_message but no send_interactive or send_components method
  • gateway/platforms/discord.py — Has ExecApprovalView showing the pattern works
  • tools/approval.py — Approval system that could benefit from native UI on all platforms

Implementation Plan

Skill vs. Tool Classification

This should be a core codebase change. It requires modifications to the platform adapters, the clarify tool, and potentially the approval system. It touches binary/event-driven platform APIs that can't be expressed as shell commands.

What We'd Need

  1. Base adapter extension: Add send_interactive() method to PlatformAdapter with a platform-agnostic component model
  2. Platform-specific renderers: Each adapter translates abstract components to native platform elements
  3. Callback handling: Route button/menu interactions back to the agent
  4. Clarify tool upgrade: Use native components instead of text lists
  5. New present_plan tool or enhancement to todo tool for structured execution plans

Component Model (Platform-Agnostic)

# Abstract components that each platform renders natively
ButtonGrid(buttons=[Button(label="GPT-4", data="gpt4"), ...])
SelectMenu(options=[Option(label="High", value="high"), ...], placeholder="Choose effort")
Checklist(items=[CheckItem(text="Read tests", checked=False), ...], actions=["Execute", "Cancel"])
Confirmation(text="Delete 47 files?", confirm="Delete", deny="Cancel")
Poll(question="Which approach?", options=["A: Refactor", "B: Rewrite"])

Phased Rollout

Phase 1: Clarify Tool + Approval Upgrade

  • Modify clarify tool to emit structured choice data (not just text)
  • Each platform adapter renders choices as native components:
    • Telegram: InlineKeyboardMarkup with callback buttons
    • Discord: View with Button components (extend existing pattern)
    • Slack: Block Kit actions with buttons
    • WhatsApp: Interactive button messages (up to 3) or list messages
    • CLI: numbered list with keyboard input (current behavior, enhanced)
  • Upgrade approval flow to use native buttons on Telegram and Slack (already works on Discord)
  • Handle callback routing: platform receives button tap → resolves pending clarify/approval future

Phase 2: Execution Plans & Structured Outputs

  • New present_plan tool or enhancement to clarify for structured plans
  • Agent can present a numbered plan with approve/modify/cancel actions
  • Render as:
    • Telegram: Message with inline keyboard (Execute / Modify / Cancel)
    • Discord: Embed with action row buttons
    • Slack: Block Kit with sections and action buttons
    • CLI: Formatted plan with input prompt
  • Track plan execution progress (update the plan message with ☑ as steps complete)
  • Integrate with todo tool — present the todo list as interactive UI

Phase 3: Rich Agent Outputs

  • Structured data display: tables rendered as platform-native formats
  • Progress cards: show long-running task progress as updating embeds/messages
  • Result summaries: present key findings as structured cards, not prose
  • Quick action suggestions: after completing a task, offer "What's next?" buttons
  • Telegram Web Apps: for complex interactions, open a mini web app inside Telegram
  • Polling: use native polls for user preference collection

Pros & Cons

Pros

  • Dramatic UX improvement — one tap instead of typing "option 2"
  • Reduces conversation turns (Step Collapse)
  • Increases user trust via visible execution plans
  • Uses platform capabilities that are already built and free
  • Makes the agent feel native to each platform, not like a text bot
  • Low-risk: can be rolled out incrementally, starting with clarify

Cons / Risks

  • Platform divergence: each platform has different component capabilities and limits
  • Callback routing adds complexity to the gateway event loop
  • WhatsApp has the most limited interactive components (3 buttons max)
  • CLI doesn't have "buttons" — need graceful degradation to text
  • Agent needs to learn when to use structured UI vs. prose (prompt engineering)
  • Rate limits on interactive messages differ by platform

Open Questions

  • Should the agent decide when to use structured UI, or should specific tools always produce it?
  • How do we handle platforms with limited component support (WhatsApp: 3 buttons max)?
  • Should execution plans be opt-in (user requests them) or default for complex tasks?
  • Should we support Telegram Web Apps for complex form inputs, or keep it simple with inline keyboards?
  • How do we handle callback timeouts? (Telegram callbacks expire after ~30 seconds)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions