Feature: Voice Mode — Speech Input/Output for CLI and Gateway Platforms

## Overview

**Inspiration:** Claude Code's `/voice` rollout (March 2026) — lets users talk to the coding agent instead of typing, toggled with a slash command.

Add voice input (speech-to-text) and voice output (text-to-speech) support to Hermes Agent across CLI and gateway platforms. Users should be able to speak to the agent and optionally hear responses read aloud.

---

## CLI UX (Primary Target)

Voice mode lives inside the existing CLI terminal experience:

1. **Activation:** User types `/voice` in the Hermes CLI to toggle voice on/off
2. **Status indicator:** A persistent banner appears at the top of the prompt area: `Voice mode enabled — hold Space to speak`
3. **Push-to-talk:** User holds the Space bar to record. Releasing sends the audio for transcription. The input prompt placeholder changes to: `> hold space bar to speak`
4. **Transcription:** Speech is transcribed to text and submitted as a normal user message — the agent processes it identically to typed input
5. **Agent response:** Text response streams to the terminal as usual. Optionally, TTS can read the response aloud (we already have `text_to_speech`). Could be a `/voice tts` sub-toggle.
6. **Deactivation:** `/voice` again to toggle off, returns to normal typing

### Implementation Notes

- Push-to-talk needs raw terminal/keyboard input (`prompt_toolkit` has key binding support — we already use it for the CLI input)
- Audio capture via `PyAudio` or `sounddevice`, stream to STT provider
- Visual feedback while recording: waveform animation or pulsing indicator in the terminal (could use `rich`/`textual` for this)
- Space bar hold must **NOT** conflict with normal typing when voice mode is off

---

## Gateway Platforms

- **Telegram:** Already receives voice messages natively — transcribe them automatically with STT and process as text. Users already send voice notes; we just need to handle the audio file attachment.
- **Discord:** Similar — voice messages come as attachments, transcribe and process
- **WhatsApp:** Voice notes are a primary interaction mode, same approach

---

## Ideas & Enhancements

- Agent can already do TTS output (`text_to_speech` tool exists) — pair with voice input for a full conversational loop
- Latency matters — voice conversations feel bad above ~2s response time
- Could adjust system prompt in voice mode to be more concise/conversational
- Audio cues for tool call confirmations, errors, completion
- Streaming STT (transcribe while user is still speaking) for lower latency

---

## Open Questions

- **Which STT provider?**
  - Local Whisper = no API dependency but needs GPU for speed
  - Deepgram/AssemblyAI = fast streaming, but adds a service dependency
  - Could support multiple backends like we do with LLM providers
- Should voice mode change the system prompt to be more conversational/concise?
- How to handle tool call confirmations in voice — audio cues?
- Do we want full duplex (agent can interrupt/be interrupted) or half-duplex?
- What are the Python dependencies? (`PyAudio` has system-level deps like `portaudio` that complicate cross-platform install)

---

## Suggested Implementation Phases

### Phase 1: Gateway voice messages (lowest effort, immediate value)
- Handle voice message attachments on Telegram/Discord/WhatsApp
- Transcribe with STT provider and submit as text
- No CLI changes needed

### Phase 2: CLI push-to-talk input
- `/voice` slash command to toggle mode
- Space-bar hold to record via `prompt_toolkit` key bindings
- Audio capture → STT → submit as text message
- Visual recording indicator

### Phase 3: TTS response output
- `/voice tts` sub-toggle to read responses aloud
- Use existing `text_to_speech` infrastructure
- Concise system prompt variant for voice conversations

### Phase 4: Low-latency & streaming
- Streaming STT for real-time transcription
- Audio cues for tool calls and confirmations
- Latency optimization (<2s target)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Voice Mode — Speech Input/Output for CLI and Gateway Platforms #314

Overview

CLI UX (Primary Target)

Implementation Notes

Gateway Platforms

Ideas & Enhancements

Open Questions

Suggested Implementation Phases

Phase 1: Gateway voice messages (lowest effort, immediate value)

Phase 2: CLI push-to-talk input

Phase 3: TTS response output

Phase 4: Low-latency & streaming

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Voice Mode — Speech Input/Output for CLI and Gateway Platforms #314

Description

Overview

CLI UX (Primary Target)

Implementation Notes

Gateway Platforms

Ideas & Enhancements

Open Questions

Suggested Implementation Phases

Phase 1: Gateway voice messages (lowest effort, immediate value)

Phase 2: CLI push-to-talk input

Phase 3: TTS response output

Phase 4: Low-latency & streaming

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions