Skip to content

Feature: Voice Mode — Speech Input/Output for CLI and Gateway Platforms #314

@teknium1

Description

@teknium1

Overview

Inspiration: Claude Code's /voice rollout (March 2026) — lets users talk to the coding agent instead of typing, toggled with a slash command.

Add voice input (speech-to-text) and voice output (text-to-speech) support to Hermes Agent across CLI and gateway platforms. Users should be able to speak to the agent and optionally hear responses read aloud.


CLI UX (Primary Target)

Voice mode lives inside the existing CLI terminal experience:

  1. Activation: User types /voice in the Hermes CLI to toggle voice on/off
  2. Status indicator: A persistent banner appears at the top of the prompt area: Voice mode enabled — hold Space to speak
  3. Push-to-talk: User holds the Space bar to record. Releasing sends the audio for transcription. The input prompt placeholder changes to: > hold space bar to speak
  4. Transcription: Speech is transcribed to text and submitted as a normal user message — the agent processes it identically to typed input
  5. Agent response: Text response streams to the terminal as usual. Optionally, TTS can read the response aloud (we already have text_to_speech). Could be a /voice tts sub-toggle.
  6. Deactivation: /voice again to toggle off, returns to normal typing

Implementation Notes

  • Push-to-talk needs raw terminal/keyboard input (prompt_toolkit has key binding support — we already use it for the CLI input)
  • Audio capture via PyAudio or sounddevice, stream to STT provider
  • Visual feedback while recording: waveform animation or pulsing indicator in the terminal (could use rich/textual for this)
  • Space bar hold must NOT conflict with normal typing when voice mode is off

Gateway Platforms

  • Telegram: Already receives voice messages natively — transcribe them automatically with STT and process as text. Users already send voice notes; we just need to handle the audio file attachment.
  • Discord: Similar — voice messages come as attachments, transcribe and process
  • WhatsApp: Voice notes are a primary interaction mode, same approach

Ideas & Enhancements

  • Agent can already do TTS output (text_to_speech tool exists) — pair with voice input for a full conversational loop
  • Latency matters — voice conversations feel bad above ~2s response time
  • Could adjust system prompt in voice mode to be more concise/conversational
  • Audio cues for tool call confirmations, errors, completion
  • Streaming STT (transcribe while user is still speaking) for lower latency

Open Questions

  • Which STT provider?
    • Local Whisper = no API dependency but needs GPU for speed
    • Deepgram/AssemblyAI = fast streaming, but adds a service dependency
    • Could support multiple backends like we do with LLM providers
  • Should voice mode change the system prompt to be more conversational/concise?
  • How to handle tool call confirmations in voice — audio cues?
  • Do we want full duplex (agent can interrupt/be interrupted) or half-duplex?
  • What are the Python dependencies? (PyAudio has system-level deps like portaudio that complicate cross-platform install)

Suggested Implementation Phases

Phase 1: Gateway voice messages (lowest effort, immediate value)

  • Handle voice message attachments on Telegram/Discord/WhatsApp
  • Transcribe with STT provider and submit as text
  • No CLI changes needed

Phase 2: CLI push-to-talk input

  • /voice slash command to toggle mode
  • Space-bar hold to record via prompt_toolkit key bindings
  • Audio capture → STT → submit as text message
  • Visual recording indicator

Phase 3: TTS response output

  • /voice tts sub-toggle to read responses aloud
  • Use existing text_to_speech infrastructure
  • Concise system prompt variant for voice conversations

Phase 4: Low-latency & streaming

  • Streaming STT for real-time transcription
  • Audio cues for tool calls and confirmations
  • Latency optimization (<2s target)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions