Skip to content

feat: Voice Mode — CLI, Telegram, Discord (text + VC), and Web UI with full voice support (Issue #314)#327

Merged
teknium1 merged 91 commits into
NousResearch:mainfrom
0xbyt4:feature/voice-mode
Mar 14, 2026
Merged

feat: Voice Mode — CLI, Telegram, Discord (text + VC), and Web UI with full voice support (Issue #314)#327
teknium1 merged 91 commits into
NousResearch:mainfrom
0xbyt4:feature/voice-mode

Conversation

@0xbyt4

@0xbyt4 0xbyt4 commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Scope expanded since initial PR: Now includes Discord voice channels
(join/listen/speak), Telegram/Discord auto voice reply, Web UI gateway
with browser voice chat, and cross-platform double TTS prevention.
See comments below for incremental updates.

Implements Voice Mode for the Hermes CLI (Issue #314, Phases 2-5). Users can speak to the agent via microphone and optionally hear responses read aloud via TTS — with sentence-by-sentence streaming for ElevenLabs.

Note: Phase 1 (Gateway voice messages) was already implemented — Telegram, Discord, WhatsApp, and Slack all handle incoming voice messages with auto-transcription.

What's New

Phase 2: CLI Voice Input

  • /voice slash command to toggle voice mode on/off
  • Ctrl+R to start/stop recording (toggle, not hold-to-talk)
  • Audio capture via sounddevice + numpy (optional deps via pip install hermes-agent[voice])
  • Multi-provider STT: OpenAI Whisper (VOICE_TOOLS_OPENAI_KEY) and Groq Whisper (GROQ_API_KEY) with automatic model correction per provider
  • Visual recording indicator: Real-time audio level bar in prompt (● ▃ ❯)
  • Transcribed text is submitted as a normal user message — agent processes it identically to typed input

Phase 3: TTS Response Output

  • /voice tts sub-toggle to read agent responses aloud
  • Uses existing text_to_speech tool infrastructure
  • Markdown stripping for TTS (removes code blocks, URLs, formatting)
  • Voice system prompt appended when voice mode is active — instructs the model to keep responses concise and conversational (2-3 sentences max)

Phase 4: Low-Latency Features

  • Silence detection: Auto-stops recording after configurable seconds of silence (default 3s). Uses RMS-based speech detection with micro-pause tolerance for natural speech patterns
  • Continuous mode: After the agent responds, recording auto-restarts so the user can keep talking hands-free. Ctrl+R exits continuous mode
  • Audio cues: 880Hz beep on record start, 660Hz double-beep on stop, 1200Hz tick on tool execution
  • TTS interrupt: Pressing Ctrl+R while TTS is playing stops playback and starts recording
  • Interruptable playback: TTS uses subprocess.Popen (not run) so stop_playback() can terminate it
  • Configurable params: voice.silence_threshold and voice.silence_duration in config.yaml
  • Whisper hallucination guard: Two-layer protection — peak RMS check rejects silent recordings before STT, then known hallucination phrases ("Thank you.", "Bye.", etc.) are filtered after STT
  • Peak RMS check: Uses maximum chunk RMS instead of overall average to avoid discarding recordings where short speech is diluted by trailing silence

Phase 5: Streaming TTS (ElevenLabs)

  • Sentence-by-sentence audio streaming — audio starts playing within ~1-2s of agent starting to respond, instead of waiting for the full response
  • Architecture: LLM tokens → stream_callback → text_queue → sentence buffer → ElevenLabs pcm_24000sounddevice.OutputStream → speaker
  • Sentence buffering: Accumulates tokens until sentence boundary (. ! ? \n\n), with 20-char minimum to merge short fragments and 100-char timeout flush for long sentences without punctuation
  • Think block filtering: <think>...</think> content is stripped in real-time so reasoning tokens are never spoken
  • Markdown stripping: Code blocks, URLs, bold, italic, headers, list items cleaned before TTS
  • Streaming API integration: run_agent._interruptible_api_call() uses stream=True when callback is set, accumulates chunks into a mock ChatCompletion response (same interface as non-streaming)
  • Graceful fallback: Only ElevenLabs gets streaming. Edge TTS and OpenAI TTS keep batch behavior. When elevenlabs or sounddevice is not installed, falls back to batch TTS automatically
  • Zero impact on non-voice mode: stream_callback defaults to None, API calls stay non-streaming
  • Low-latency model: Uses eleven_flash_v2_5 (~75ms latency) by default, configurable via tts.elevenlabs.streaming_model_id in config.yaml

Design Decisions

Why Ctrl+R toggle instead of Space-bar hold-to-talk?

The issue suggested hold-Space, but prompt_toolkit doesn't support key-up events, making hold-to-talk infeasible. Ctrl+R toggle combined with silence detection provides a better UX — the user presses once, speaks naturally, and recording auto-stops when they're done. No need to hold anything.

Why not streaming STT?

OpenAI Whisper API and Groq Whisper API don't support streaming transcription. Adding a streaming provider (Deepgram/AssemblyAI) would require new service dependencies. The current approach (record → auto-stop on silence → transcribe) is reliable and keeps the dependency footprint small.

Why only ElevenLabs for streaming TTS?

ElevenLabs returns raw PCM chunks that can be written directly to sounddevice.OutputStream for zero-copy playback. Edge TTS is async and outputs MP3 (needs decoding), OpenAI TTS returns complete files. The streaming architecture requires chunk-by-chunk audio iteration which only ElevenLabs supports natively.

CoreAudio safety

On macOS, sd.play() (beep) and sd.InputStream (recording) conflict when running simultaneously (PaMacCore error). Beeps are played synchronously BEFORE starting the recording stream to avoid this.

Quick Usage

/voice on          # Enable voice mode
Ctrl+R             # Start recording (one press, no need to hold)
                   # Speak naturally — recording auto-stops on 3s silence
                   # Transcript is submitted as text, agent responds
                   # In continuous mode, recording auto-restarts after response
Ctrl+R             # Stop recording & exit continuous mode
/voice tts         # Toggle TTS (agent reads responses aloud)
/voice off         # Disable voice mode

Streaming TTS Setup (optional):

pip install hermes-agent[voice,tts-premium]  # adds elevenlabs + sounddevice
export ELEVENLABS_API_KEY="sk-..."

In ~/.hermes/config.yaml:

tts:
  provider: elevenlabs
  elevenlabs:
    voice_id: pNInz6obpgDQGcFmaJgB          # Adam (default)
    streaming_model_id: eleven_flash_v2_5     # low-latency model

STT Provider: Tested with Groq Whisper (GROQ_API_KEY), also supports OpenAI Whisper (VOICE_TOOLS_OPENAI_KEY). Groq's free tier works well for this.

Install: pip install hermes-agent[voice] (adds sounddevice + numpy)

Files Changed

File Change
tools/voice_mode.py New — AudioRecorder, silence detection, beep generation, hallucination filter, interruptable playback
tools/tts_tool.py Added stream_tts_to_speaker() — sentence buffer, think block filter, markdown stripping, ElevenLabs PCM streaming to sounddevice
tools/transcription_tools.py Added Groq STT provider, auto model correction, multi-provider resolution
run_agent.py Added stream_callback param to run_conversation()/chat(), streaming path in _interruptible_api_call() with mock ChatCompletion assembly
cli.py Voice mode integration — key bindings, continuous mode, audio level UI, streaming TTS pipeline wiring, batch TTS skip when streaming active
hermes_cli/config.py Voice config defaults (silence_threshold, silence_duration)
hermes_cli/commands.py /voice command registration
pyproject.toml [voice] optional dependency group
.env.example GROQ_API_KEY documentation
tests/tools/test_voice_mode.py 34 tests — recorder, silence detection, beep, hallucination filter, playback, cleanup
tests/tools/test_transcription_tools.py 12 tests — multi-provider STT, model correction
tests/tools/test_voice_cli_integration.py 14 tests — markdown stripping, command parsing, thread safety

Test Plan

  • 77 unit tests pass (test_voice_mode.py + test_transcription_tools.py + test_voice_cli_integration.py)
  • 88 agent tests pass (test_cli_init.py + test_run_agent.py)
  • Manual testing: /voice on → Ctrl+R → speak → auto-stop → transcription → agent response
  • Manual testing: continuous mode auto-restart after agent responds
  • Manual testing: /voice tts reads responses aloud
  • Manual testing: Ctrl+R stops recording even while agent is running
  • Manual testing: Ctrl+R interrupts TTS playback
  • Manual testing: Streaming TTS — audio starts playing before full response appears
  • Manual testing: ElevenLabs API key + sounddevice end-to-end pipeline verified
  • Fallback: Edge TTS batch mode works when ElevenLabs not configured

Closes #314

@0xbyt4 0xbyt4 changed the title feat: Voice Mode for CLI — Speech Input/Output (Issue #314) feat: Voice Mode for CLI — Speech Input/Output + Streaming TTS (Issue #314) Mar 3, 2026
@teknium1

teknium1 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Really cool feature concept — voice mode would be a great addition to Hermes. A few concerns about cross-platform/environment compatibility before this could be merged:

Audio dependency fragility (sounddevice + PortAudio):

  • SSH sessions — no audio device available, PortAudio will crash/error on import
  • WSL2 — no native audio subsystem; needs a PulseAudio bridge to Windows that most users won't have
  • PuTTY / headless servers / Docker containers — no audio devices, sd.InputStream() and sd.play() will throw
  • No mic plugged in — PortAudio can fail even on desktop machines without input devices
  • This is the biggest concern — Hermes runs in all of these environments daily. If sounddevice is imported at module level or eagerly, it would break Hermes for anyone not on a local desktop with audio hardware

Key binding conflict:

  • Ctrl+R is the standard reverse-history-search binding in readline/prompt_toolkit. Overriding it would break muscle memory for a lot of CLI users. This should be configurable, or use a different binding

Core agent loop changes:

  • The PR modifies _interruptible_api_call() in run_agent.py to add streaming support. Changing the core agent loop for a feature that only works in specific environments is risky — any bug here affects every user, not just voice mode users

What we'd need to feel comfortable merging:

  1. Fully lazy importssounddevice, numpy, elevenlabs must never be imported until voice mode is explicitly activated. Any import failure should be caught and surface a friendly message, not crash
  2. Graceful degradation at every audio touchpoint — every call to sd.play(), sd.InputStream(), sd.OutputStream() needs to be wrapped so it fails silently or with a warning in non-audio environments
  3. No core agent loop changes — the streaming path should be implemented without modifying _interruptible_api_call(). Consider wrapping the response after the fact rather than changing how the API call works
  4. Configurable key binding — don't override Ctrl+R by default
  5. Environment detection — auto-disable voice features when running over SSH, in containers, or without audio devices, rather than crashing

The feature itself is genuinely exciting — just needs to be bulletproof for the environments Hermes runs in. Happy to review again once these are addressed!

@0xbyt4

0xbyt4 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor Author

hi @teknium1 thank you for review and addressed issues solved:

  1. Lazy importssounddevice, numpy, elevenlabs, edge_tts, openai are never imported at module level. Each has a lazy helper
    (_import_audio(), _import_edge_tts(), etc.) called only when voice mode is activated.

  2. Graceful degradation — Every audio call (sd.InputStream, sd.play, sd.OutputStream, sd.stop, sd.query_devices) is wrapped in try-except
    with friendly error messages.

  3. No core agent loop changes_interruptible_api_call() is untouched. Streaming lives in a separate _streaming_api_call() method that only
    runs when voice TTS is active.

  4. Configurable key binding — Default is Ctrl+B (not Ctrl+R). Configurable via voice.record_key in config.yaml.

  5. Environment detectiondetect_audio_environment() auto-detects SSH, Docker, WSL, and missing audio devices. Voice features are disabled with
    warnings instead of crashing.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for addressing the original feedback — lazy imports, graceful degradation, Ctrl+B default, and environment detection all look good. A few remaining issues before we can merge:

Bug A: Streaming TTS can never activate (stale imports)

cli.py:3447-3448 imports _HAS_ELEVENLABS and _HAS_AUDIO from tools.tts_tool:

from tools.tts_tool import (
    _load_tts_config as _load_tts_cfg,
    _get_provider as _get_prov,
    _HAS_ELEVENLABS as _el_ok,
    _HAS_AUDIO as _audio_ok,
    stream_tts_to_speaker,
)

These module-level booleans no longer exist — they were removed when you switched to lazy import functions (_import_elevenlabs(), etc.). The try/except Exception: pass wrapper silently swallows the ImportError, so use_streaming_tts is always False. The entire Phase 5 streaming TTS pipeline is dead code right now.

Fix: Replace the boolean checks with calls to the lazy import helpers, e.g.:

try:
    from tools.tts_tool import (
        _load_tts_config as _load_tts_cfg,
        _get_provider as _get_prov,
        _import_elevenlabs,
        _import_sounddevice,
        stream_tts_to_speaker,
    )
    _tts_cfg = _load_tts_cfg()
    _el_ok = False
    _audio_ok = False
    try:
        _import_elevenlabs()
        _el_ok = True
    except ImportError:
        pass
    try:
        _import_sounddevice()
        _audio_ok = True
    except (ImportError, OSError):
        pass
    if _get_prov(_tts_cfg) == "elevenlabs" and _el_ok and _audio_ok:
        use_streaming_tts = True
except Exception:
    pass

Bug B: Voice mode system prompt is a no-op

_enable_voice_mode() appends the "[Voice mode active] keep responses concise..." instruction to self.system_prompt on HermesCLI. But the agent's ephemeral_system_prompt is set once during _init_agent() and is never re-read from the CLI object. Changing self.system_prompt mid-session has no effect on the agent's behavior — the concise-response instruction never reaches the model.

Important: Even if you fix the propagation, modifying the system prompt mid-conversation would break prompt caching (the cache prefix becomes invalid). This is a core policy — see AGENTS.md.

Suggested fix: Instead of modifying the system prompt, inject the voice mode instruction as a user message prefix when voice input is submitted. Something like prepending [Voice input] to the transcribed text, or adding a brief instruction in the user message itself. This keeps the system prompt stable and avoids cache invalidation.

Bug C: Branch needs rebase

The branch is ~20 commits behind main (merge base is c21d77c). Please rebase onto current main before resubmitting.

Minor concern: _vprint suppresses error messages

The _vprint() changes in run_agent.py suppress ALL console output when streaming TTS is active — including API error messages, retry info, and context limit warnings. Consider only suppressing informational/progress prints, not error-level messages. For example, keep direct print() for lines containing ❌ or ⚠️ error conditions.

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch 2 times, most recently from e03e481 to b16527c Compare March 10, 2026 15:22
@0xbyt4

0xbyt4 commented Mar 10, 2026

Copy link
Copy Markdown
Contributor Author

IMG_1401

Update: Telegram Gateway Voice Mode + Critical Bug Fix

Bug Fix: _keep_typing session deadlock (c785253)

Found and fixed a critical bug in BasePlatformAdapter._process_message_background():

  • _keep_typing() was called with metadata=_thread_metadata but didn't accept that parameter
  • The TypeError crashed before the try-finally block, so _active_sessions was never cleaned up
  • Every subsequent message saw the session as "active" and went into the interrupt path — effectively deadlocking the entire chat
  • Fix: added metadata parameter to _keep_typing(), send_typing() base class, and SignalAdapter.send_typing()

Feature: /voice command for Telegram gateway (3f63462)

Auto voice reply mode for the Telegram bot:

  • /voice on — voice reply only when user sends voice messages
  • /voice tts — voice reply to all messages (text + voice)
  • /voice off — disable, text-only replies
  • /voice status — show current mode
  • /voice (no args) — toggle on/off
  • Per-chat state persisted to ~/.hermes/gateway_voice_mode.json
  • Dedup: skips auto-reply if agent already called text_to_speech tool
  • drop_pending_updates=True added to ignore stale Telegram messages on restart
  • 25 tests, all passing (518 total gateway tests, 0 regressions)

@0xbyt4

0xbyt4 commented Mar 10, 2026

Copy link
Copy Markdown
Contributor Author
Screenshot 2026-03-11 at 00 28 35 ## Update: Discord Voice Mode + Cross-Platform Fix

Discord /voice slash command (d79a8e6)

  • Registered /voice as a Discord slash command with dropdown choices (on, tts, off, status)
  • Same voice reply logic as Telegram — no code duplication

Cross-platform send_voice fix

  • _send_voice_reply() was passing metadata= kwarg to all adapters, but Discord's send_voice() doesn't accept it
  • Now inspects the adapter method signature at runtime and only passes metadata if supported
  • Works correctly on Telegram (metadata supported), Discord (metadata skipped), and Slack (metadata supported)

TTS provider note

  • ElevenLabs free tier blocks requests through VPN — switched to edge-tts (free, no API key, no VPN issues) as a fallback provider

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch 3 times, most recently from 11d431f to 44d661f Compare March 11, 2026 12:50
@0xbyt4

0xbyt4 commented Mar 11, 2026

Copy link
Copy Markdown
Contributor Author
IMG_1405 ## Update: Discord Voice Channel Support + Documentation

Phase 1: Bot joins VC and speaks replies (f83b1f4)

  • /voice join — bot joins the user's current voice channel
  • /voice channel — alias for join
  • /voice leave — bot disconnects from VC
  • TTS replies are played directly in the voice channel via Opus encoding
  • Echo prevention: audio listener pauses while bot is speaking
  • Only DISCORD_ALLOWED_USERS can interact via voice

Phase 2: Bot listens in VC — full STT pipeline (a5a0ded)

Complete voice-to-voice loop: user speaks in VC → STT → agent → TTS → VC playback

  • VoiceReceiver class captures per-user RTP audio packets
  • Decrypts NaCl transport encryption + DAVE E2E encryption
  • Per-user Opus decoders (48kHz stereo → PCM)
  • Silence detection: 1.5s silence after 0.5s speech triggers processing
  • PCM → 16kHz mono WAV conversion via ffmpeg
  • Whisper STT transcription (Groq or OpenAI)
  • Transcripts appear in text channel: [Voice] @user: what they said
  • Agent response sent as text AND spoken in VC

Bug fixes during integration:

  • Adapter dict key: "discord"Platform.DISCORD enum
  • Local import shadowing top-level Platform causing UnboundLocalError
  • Synthetic voice events missing raw_message.guild_id for _get_guild_id()

Documentation (1175f16, 44d661f)

New comprehensive voice mode doc: website/docs/user-guide/features/voice-mode.md

  • Prerequisites — hermes install, LLM config, first run
  • CLI Voice Modehermes startup, /voice commands, Ctrl+B flow, silence detection, streaming TTS, hallucination filter
  • Gateway Voice Reply — Telegram & Discord /voice commands, modes, platform delivery formats
  • Discord Voice Channels — full setup guide (bot permissions with OAuth2 URL, privileged intents, opus codec, env vars), commands, 10-step pipeline explanation, text channel integration, echo prevention, access control
  • Configuration Reference — config.yaml, env vars, STT/TTS provider comparisons
  • Troubleshooting — common issues and fixes

Rebase

Rebased onto latest main (75 commits), resolved 4 conflict areas (commands registry, test expected commands, run.py gateway commands + voice reply). All tests passing.

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch 2 times, most recently from dfb8595 to deaf36e Compare March 11, 2026 17:37
@0xbyt4

0xbyt4 commented Mar 11, 2026

Copy link
Copy Markdown
Contributor Author
Screenshot 2026-03-11 at 22 24 56 IMG_1419 ## Update: Web UI Gateway + Double TTS Fixes

Web Gateway — Browser-based Chat UI

Full-featured browser chat interface accessible from any device on the network:

  • WebSocket-based real-time messaging over ws://
  • Token authentication — configurable via WEB_UI_TOKEN env var
  • Voice conversation — browser mic recording with VAD silence detection
  • Invisible TTS playback — audio plays without chat bubble
  • Futuristic UI — glassmorphism design, purple theme, glow effects
  • Media support — images, voice bubbles with waveform player
  • /remote-control command — start Web UI on demand from any platform
  • LAN access — auto-detects local IPs, shows all access URLs on startup
  • Toolsethermes-web registered with full tool access

Double TTS Prevention

Fixed duplicate audio playback across all platforms. Two independent TTS paths were firing for the same message:

  1. Base adapter auto-TTS (play_tts) — for voice input messages
  2. Gateway runner _send_voice_reply — for voice mode enabled chats

Fixes:

  • send_voice(**kwargs) — Discord and Slack adapters now accept extra keyword arguments
  • skip_double guard — runner skips voice reply for voice input (base already handled it)
  • Discord VC exception — when bot is in voice channel, runner handles VC playback directly
  • Discord play_tts override — skips file attachment when connected to voice channel
Platform Voice Input Text + /voice tts Discord VC
Base auto-TTS fires skip skip (VC override)
Runner voice reply skip fires fires (VC playback)
Result 1 audio 1 audio 1 audio (VC)

Documentation Updates

  • Discord DMs — DM vs server channel interaction, @mention requirement, DISCORD_REQUIRE_MENTION config
  • macOS firewall — allow Python through firewall for LAN access
  • Mobile HTTPS — mic requires HTTPS on mobile; documented workarounds (Android Chrome flag, mkcert, Caddy, SSH tunnel)

Tests

  • 32 tests for Web adapter (config, auth, messaging, media, LAN IP)
  • 32 tests for voice command (full platform x input x mode matrix, Discord VC skip, Web play_audio)
  • 3489 tests passing, 0 failures

@0xbyt4 0xbyt4 changed the title feat: Voice Mode for CLI — Speech Input/Output + Streaming TTS (Issue #314) feat: Voice Mode — CLI, Telegram, Discord (text + VC), and Web UI with full voice support (Issue #314) Mar 11, 2026
@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch 4 times, most recently from fd10f94 to 9837473 Compare March 13, 2026 01:08
@teknium1

Copy link
Copy Markdown
Contributor

Code Review — Round 3

First off, really impressive work here — the scope and quality of the CLI voice integration, streaming TTS pipeline, and Discord VC implementation show real engineering skill. The thread safety patterns, lazy imports, and graceful degradation are all well done. Bugs A and B from the previous review are properly fixed with regression tests. 👏

That said, there are several issues that need to be addressed before this can merge:


🔴 Blocking Issues

1. Rebase Regression: _interruptible_api_call lost Anthropic interrupt support

During the rebase, the Anthropic-aware interrupt handler was accidentally moved FROM _interruptible_api_call INTO _streaming_api_call. The result is that _interruptible_api_call now has a simplified handler that:

  • Calls self.client.close() (OpenAI only)
  • Rebuilds self.client = OpenAI(...)
  • Never checks for api_mode == "anthropic_messages"
  • Never closes self._anthropic_client

This affects ALL Anthropic users on interrupt (not just voice users). On interrupt: wrong client closed, wrong client rebuilt, token generation continues on the Anthropic side. This is a critical regression.

2. Web Gateway — Path Traversal in File Uploads

The upload handler uses the user-supplied filename without sanitization:

orig_name = field.filename or "file"
filename = f"upload_{uuid.uuid4().hex[:8]}_{orig_name}"
dest = self._media_dir / filename

If orig_name contains ../, files can be written outside media_dir. Fix: use Path(orig_name).name or os.path.basename().

3. Web Gateway — Unauthenticated /media Route

The static file serving via aiohttp.add_static() has no auth check. Anyone on the network can access uploaded files, voice recordings, and images by guessing the UUID-prefixed filenames. Media should be served through an authenticated handler.

4. Web Gateway — XSS via innerHTML

Bot messages are rendered via marked.parse() + innerHTML without sanitization. If the LLM response contains HTML/JS, it executes in the browser. Needs DOMPurify or equivalent before inserting into the DOM.

5. Web Gateway — Token Exposed via /remote-control

The /remote-control slash command echoes the auth token into the chat response. In Discord servers or group chats, any participant who can read the channel gets full web UI access.

6. Branch is 358 commits behind main

The merge base is c21d77c. run_agent.py, cli.py, gateway/, and discord.py have all changed significantly since then. This is practically unmergeable as-is — even the rebase that was done introduced the Anthropic regression in issue #1 above.


🟡 Should-Fix

  • Web server binds 0.0.0.0 by default over plaintext HTTP — tokens are sniffable on LAN. Should default to 127.0.0.1 with explicit opt-in for LAN binding.
  • Token comparison uses == instead of hmac.compare_digest() — timing side-channel.
  • Hardcoded macOS Opus path (/opt/homebrew/lib/libopus.dylib) loaded before Linux path — should use ctypes.util.find_library().
  • _vprint suppresses some interrupt confirmation messages (3 instances of "interrupt detected during retry" lack force=True), so users get no feedback their interrupt was processed during voice mode.
  • Discord VC debug logging — first 5-10 RTP packets log raw hex at INFO level, should be DEBUG.
  • _keep_typing fix: the PR description mentions a critical _keep_typing deadlock fix, but no changes to _keep_typing or its lock/session coordination appear in the diff.
  • All web sessions share chat_id="web" — no per-user session isolation; multiple simultaneous web users would share conversation context.

🟢 What's Done Well

  • CLI voice integration is excellent — proper thread safety with _voice_lock, key bindings dispatched to daemon threads, atomic guards against double-start/stop, real-time audio level bar, continuous mode with 3-strike safety valve
  • Streaming TTS pipeline is well-architected — queue-based sentence buffering, dual cleanup paths, think-block filtering, graceful ElevenLabs fallback
  • Import safety is clean — zero module-level audio imports, everything lazily loaded
  • Discord VC implementation is solid — VoiceReceiver with proper lifecycle, echo prevention, inactivity auto-disconnect
  • Bug A and B fixes are clean with regression tests
  • browser_tool.py signal handler fix is a legitimate improvement
  • 60+ new tests with good coverage

Path Forward

Given the 358-commit gap and the security issues in the web gateway, a full rebase would be very challenging. One option would be to split this into smaller, focused PRs:

  1. CLI Voice Mode (voice_mode.py, transcription_tools.py, tts_tool.py, cli.py voice integration, config) — this is the strongest part and closest to merge-ready
  2. Gateway Voice Reply (Telegram + Discord /voice command) — relatively self-contained
  3. Discord Voice Channels — separate feature, can stand alone
  4. Web UI Gateway — needs the security fixes and is effectively a new subsystem

This would make each piece easier to rebase, review, and merge independently. Happy to help with any of this!

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch from 9837473 to 5fdf0e3 Compare March 13, 2026 14:26
@0xbyt4

0xbyt4 commented Mar 13, 2026

Copy link
Copy Markdown
Contributor Author

Round 3 Review — All Issues Addressed

Thanks for the thorough review @teknium1 All blocking and should-fix items have been resolved.


Blocking Issues (6/6 Fixed)

# Issue Fix Commit
1 _interruptible_api_call lost Anthropic interrupt support Restored Anthropic-aware handler: checks api_mode == "anthropic_messages", closes _anthropic_client, rebuilds both clients 45baa4f
2 Path traversal in file uploads Path(field.filename).name strips ../ sequences aed9e28
3 Unauthenticated /media route Replaced add_static() with authenticated _handle_media handler — requires ?token= query param, validates with hmac.compare_digest, applies Path(filename).name sanitization aed9e28
4 XSS via marked.parse() + innerHTML Added DOMPurify CDN — all bot message HTML is sanitized via DOMPurify.sanitize(marked.parse(text)) aed9e28
5 Token exposed via /remote-control Token only shown in DM; group chats show "(hidden — check DM)" aed9e28
6 Branch 358 commits behind main Fully rebased onto current main rebased

Should-Fix Items (7/7 Fixed)

# Issue Fix Commit
1 Web server binds 0.0.0.0 by default Default changed to 127.0.0.1 in config, adapter, and /remote-control. Startup message shows only reachable URLs with hint for LAN opt-in aed9e28, 327f881
2 Token comparison uses == All 3 token checks replaced with hmac.compare_digest() aed9e28
3 Hardcoded macOS Opus path Primary: ctypes.util.find_library("opus"). Fallback: Homebrew paths on macOS only, guarded by sys.platform == "darwin" and os.path.isfile() 9e91937
4 _vprint suppresses interrupt messages Added force=True to all 5 interrupt confirmation _vprint calls 45baa4f
5 RTP packet logging at INFO level Demoted raw UDP, non-RTP skip, and RTP packet logs to logger.debug. SPEAKING events remain at INFO c32fc7e
6 _keep_typing deadlock fix not in diff Already in branch — metadata param added to _keep_typing(), send_typing() base class, and SignalAdapter.send_typing() pre-existing
7 All web sessions share chat_id="web" Changed to chat_id=f"web_{session_id}" — each WebSocket connection gets isolated conversation context c32fc7e

Additional Fixes (this session)

  • Gateway shutdown crash: RuntimeError: dictionary changed size during iteration in stop() — iterate over list(self.adapters.items()) copy (9e91937)
  • Web UI token exposure in logs: Configured tokens are no longer printed to console; only auto-generated tokens are shown (0c87dfa)
  • Empty WEB_UI_HOST env var: Falls back to 127.0.0.1 instead of binding to empty string (5fdf0e3)
  • Web UI env vars missing from docs: Added WEB_UI_ENABLED, WEB_UI_PORT, WEB_UI_HOST, WEB_UI_TOKEN to environment-variables.md reference (7936d33)

Test Coverage

All fixes have corresponding tests:

  • 196 tests in test_web.py + test_discord_opus.py + test_run_agent.py — all passing
  • TestPathTraversalSanitization (3) — Path.name strips traversal, upload produces safe filename
  • TestMediaEndpointAuth (4) — 401 without/wrong token, 200 with valid token, traversal blocked
  • TestHmacTokenComparison (2) — no ==/!= for token, hmac.compare_digest present
  • TestDomPurifyPresent (2) — DOMPurify script tag, sanitize(marked.parse()) pattern
  • TestDefaultBindLocalhost (2) — adapter and config default to 127.0.0.1
  • TestRemoteControlTokenHiding (2) — token visible in DM, hidden in group
  • TestVpnAndMultiInterfaceIp (7) — LAN preferred over VPN, fallbacks, loopback filtering
  • TestStartupTokenExposure (4) — auto-generated flag, configured token hidden
  • TestOpusFindLibrary (3) — find_library first, Homebrew fallback conditional, decode errors logged
  • TestInterruptVprintForceTrue (1) — all interrupt _vprint calls have force=True
  • TestAnthropicInterruptHandler (3) — Anthropic branch present, client rebuilt

Rebase

  • Rebased onto latest main
  • Resolved 9 conflict files (slack.py, toolsets.py, pyproject.toml, base.py, config.py, run.py, test_run_agent.py, cli.py, run_agent.py)
  • Verified all main changes preserved: parallel tool execution, Honcho manager params, Anthropic adapter, secret state

PR Splitting

Considered splitting into 4 PRs as suggested, but decided against it , the features are tightly coupled:

  • Gateway voice reply depends on CLI voice/TTS infrastructure
  • Discord VC reuses the same STT/TTS pipeline and voice_mode state
  • Web UI shares the gateway voice reply system and media handling

Splitting would mean duplicating shared code or creating artificial boundaries. All commits are already logically grouped and the rebase is clean.

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch 2 times, most recently from 522494e to d9df64c Compare March 13, 2026 21:03
@teknium1

Copy link
Copy Markdown
Contributor

Review — Round 4

Great work addressing all the Round 3 feedback — the security fixes, Anthropic interrupt handler, and overall code quality are solid. The lazy imports, thread safety, and streaming TTS architecture are genuinely well-engineered.

However, we need to make some scope changes before this can merge:

Web UI Gateway — Please Remove

We are building our own official chat UI and dashboard for Hermes Agent. We cannot accept the web gateway (gateway/platforms/web.py, Platform.WEB, /remote-control command, hermes-web toolset) in this PR.

We should have been clearer about this in Round 3 — we suggested splitting the web UI into a separate PR but then gave detailed security feedback on it, which sent mixed signals. Apologies for that.

Please remove all web UI related code from this PR:

  • gateway/platforms/web.py
  • tests/gateway/test_web.py
  • website/docs/user-guide/messaging/web.md
  • Platform.WEB enum addition in gateway/config.py
  • hermes-web toolset in toolsets.py
  • /remote-control command in gateway/run.py
  • WEB_UI_* env var handling in gateway/config.py
  • Any web-related imports/references in gateway/run.py

If you want the web UI considered separately, feel free to open a new PR for it — but it will likely conflict with our own UI plans.

Remaining Issues to Fix (voice mode code)

With the web UI removed, these items remain:

1. sd.wait() in play_audio_file() can hang forever (tools/voice_mode.py)
Your play_beep() correctly avoids sd.wait() with a polling loop + 2s timeout (and the comments even explain why). But play_audio_file() still uses sd.wait(), which can block indefinitely if the audio device stalls. Please use the same polling pattern for consistency.

2. transcription_tools.py imports faster_whisper at module level
voice_mode.py is fully lazy (excellent work there), but transcription_tools.py does a module-level try: from faster_whisper import WhisperModel that runs at import time. If faster_whisper triggers a heavy native library load or crashes, it affects all code that imports the module. Please use the same lazy import pattern.

3. inspect.signature() in _send_voice_reply (gateway/run.py)
Checking if adapter.send_voice supports metadata by inspecting its signature at each call is fragile. Please use **kwargs pattern instead, or just ensure all adapters accept metadata (which they should after PR #1178 fixed Discord's signatures).

4. Unrelated SessionResetPolicy null-handling fix
The bugfix in gateway/config.py for SessionResetPolicy null handling is unrelated to voice mode. Please either remove it from this PR (we can merge it separately as a one-liner) or at minimum make it a separate commit so it's clear in git history.


Once these are addressed and the web UI code is removed, this should be ready to merge. The CLI voice mode, gateway voice reply, and Discord VC features are strong work. 🎉

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch from 0e45dcd to d43ef9c Compare March 14, 2026 06:12
@0xbyt4

0xbyt4 commented Mar 14, 2026

Copy link
Copy Markdown
Contributor Author

Thank you @teknium1 for reviewing !

Round 4 — All Issues Addressed

Web UI — Removed

Completely removed all web UI code from the PR:

  • gateway/platforms/web.py, tests/gateway/test_web.py, website/docs/user-guide/messaging/web.md deleted
  • Platform.WEB enum, WEB_UI_* env handling, /remote-control command, hermes-web toolset removed
  • All references cleaned from docs (index.md, voice-mode.md, environment-variables.md, .env.example)
  • Session loading made resilient to removed platform values (skips unknown entries instead of crashing)

Fix 1: sd.wait() hang in play_audio_file()

Replaced with polling pattern + timeout, consistent with play_beep() which already had this fix with a comment explaining why sd.wait() is unsafe.

Fix 2: faster_whisper module-level import

Changed to importlib.util.find_spec() for availability checks — no module loading at import time. Actual from faster_whisper import WhisperModel and from openai import OpenAI now happen inside the transcription functions only when needed.

Fix 3: inspect.signature() in _send_voice_reply

Removed the inspect.signature() hack. Added **kwargs to TelegramAdapter.send_voice() — all adapters now uniformly accept metadata.

Fix 4: SessionResetPolicy null-handling

This fix is already in main (PR #1194). Not present in our diff against main — no action needed.

Rebase

Rebased onto latest main (23 new commits). Resolved 3 conflict areas in docs. All tests passing locally (3824 passed).

@teknium1

Copy link
Copy Markdown
Contributor

Thanks — this is much closer now, and removing the web UI scope was the right call. I re-reviewed the current branch against main and there are still a few required fixes before we can merge:

  1. run_agent.py: stream_callback is still OpenAI-chat-only
  • run_conversation() routes any non-None stream_callback into _streaming_api_call().
  • _streaming_api_call() still unconditionally calls self.client.chat.completions.create(..., stream=True).
  • In anthropic_messages mode, self.client is None, so this still breaks for Anthropic.
  • It also skips the normal provider-specific streaming paths.

Required fix:

  • Either gate streaming TTS to providers that actually support the current _streaming_api_call() implementation, or implement provider-correct streaming for Anthropic/Codex there.
  • Also preserve Anthropic base_url whenever rebuilding the client after interrupt/fallback. The constructor passes base_url into build_anthropic_client(...), but the interrupt/fallback rebuild paths currently drop it.
  1. Discord VC synthetic events are still keyed/authenticated like DMs
  • _handle_voice_channel_input() posts the transcript into the text channel before gateway auth.
  • It then constructs a synthetic SessionSource without chat_type / server-channel context, so it falls back to the default chat_type="dm".
  • build_session_key() then collapses those into the shared Discord DM session key instead of a server/channel-scoped session.
  • That can cause session/context bleed, and unauthorized VC users can still get transcript text echoed publicly before the normal auth flow runs.

Required fix:

  • Build the synthetic VC source with the correct server/channel context (not DM defaults).
  • Run authorization before echoing the transcript publicly.
  • Make sure VC traffic cannot fall into the DM pairing path.
  • When you do echo transcript text, do not send raw mentionable content directly.
  1. The CLI voice prefix is not actually turn-local
  • In cli.py, the voice path prepends the concise-response instruction to agent_message.
  • That prefixed message is then persisted by run_conversation() and written back into self.conversation_history.
  • The comment says the original history stays clean, but the current flow does not keep it clean.

Required fix:

  • Keep the voice instruction API-call-local only.
  • Do not let the synthetic [Voice input ...] prefix get persisted into conversation history / session DB / resumed sessions.
  1. /voice off still disagrees with runtime behavior
  • The command/status/docs say off means text-only.
  • But the base adapter still auto-generates TTS for voice inputs unconditionally.

Required fix:

  • Either make off truly text-only, or change the product semantics/docs/tests to match the intended behavior.
  • We should not merge while the user-facing contract and actual behavior disagree.

Once those are fixed, this looks very close. The core CLI voice work is strong — just need these last correctness issues cleaned up.

@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch from f487cdd to f7b3411 Compare March 14, 2026 07:45
0xbyt4 added 14 commits March 14, 2026 14:27
- Patch WEB_UI_HOST in test_web_defaults to avoid env leak
- Handle empty WEB_UI_HOST string in config (fall back to 127.0.0.1)
- Change RTP packet logging from INFO to DEBUG level to reduce noise
  (SPEAKING events remain at INFO as they are important lifecycle events)
- Use per-session chat_id (web_{session_id}) instead of shared "web"
  to isolate conversation context between simultaneous web users
Merge main's faster-whisper (local, free) with our Groq support into a
unified three-provider STT pipeline: local > groq > openai.

Provider priority ensures free options are tried first. Each provider
has its own transcriber function with model auto-correction, env-
overridable endpoints, and proper error handling.

74 tests cover the full provider matrix, fallback chains, model
correction, config loading, validation edge cases, and dispatch.
Voice status was hardcoded to check API keys only. Now uses the actual
provider resolution (local/groq/openai) so it correctly shows
"local faster-whisper" when installed instead of "Groq" or "MISSING".
Move stream close outside the lock in shutdown() to prevent deadlock
when audio callback tries to acquire the same lock. Replace single
t.join(timeout) with a polling loop (0.1s intervals) so KeyboardInterrupt
is not blocked during stream cleanup.
…ider key

- web.py: pass stt_model from config like discord.py and run.py do
- run.py: match new error messages (No STT provider / not set)
- _transcribe_local: add missing "provider": "local" to return dict
…rface issues

Remove web UI gateway (web.py, tests, docs, toolset, env vars, Platform.WEB
enum) per maintainer request — Nous is building their own official chat UI.

Fix 1: Replace sd.wait() with polling pattern in play_audio_file() to prevent
indefinite hang when audio device stalls (consistent with play_beep()).

Fix 2: Use importlib.util.find_spec() for faster_whisper/openai availability
checks instead of module-level imports that trigger heavy native library
loading (CUDA/cuDNN) at import time.

Fix 3: Remove inspect.signature() hack in _send_voice_reply() — add **kwargs
to Telegram send_voice() so all adapters accept metadata uniformly.

Fix 4: Make session loading resilient to removed platform enum values — skip
entries with unknown platforms instead of crashing the entire gateway.
…efix, auto-TTS control

1. Gate _streaming_api_call to chat_completions mode only — Anthropic and
   Codex fall back to _interruptible_api_call. Preserve Anthropic base_url
   across all client rebuild paths (interrupt, fallback, 401 refresh).

2. Discord VC synthetic events now use chat_type="channel" instead of
   defaulting to "dm" — prevents session bleed into DM context.
   Authorization runs before echoing transcript. Sanitize @everyone/@here
   in voice transcripts.

3. CLI voice prefix ("[Voice input...]") is now API-call-local only —
   stripped from returned history so it never persists to session DB or
   resumed sessions.

4. /voice off now disables base adapter auto-TTS via _auto_tts_disabled_chats
   set — voice input no longer triggers TTS when voice mode is off.
…response

The mock's app_commands SimpleNamespace lacked choices and Choice attrs,
causing xdist test ordering failures when this mock loaded before
test_discord_slash_commands.
1. Anthropic + ElevenLabs TTS silence: forward full response to TTS
   callback for non-streaming providers (choices first, then native
   content blocks fallback).

2. Subprocess timeout kill: play_audio_file now kills the process on
   TimeoutExpired instead of leaving zombie processes.

3. Discord disconnect cleanup: leave all voice channels before closing
   the client to prevent leaked state.

4. Audio stream leak: close InputStream if stream.start() fails.

5. Race condition: read/write _on_silence_stop under lock in audio
   callback thread.

6. _vprint force=True: show API error, retry, and truncation messages
   even during streaming TTS.

7. _refresh_level lock: read _voice_recording under _voice_lock.
The rebase added voice prompt checks to _get_tui_prompt_fragments but
the test stub was missing _voice_recording, _voice_processing and
_voice_mode attributes, causing AttributeError.
@0xbyt4 0xbyt4 force-pushed the feature/voice-mode branch from a7f86ca to 92c14ec Compare March 14, 2026 12:07
teknium1 added a commit that referenced this pull request Mar 14, 2026
fix: salvage PR #327 voice mode onto current main
@teknium1 teknium1 merged commit 523a1b6 into NousResearch:main Mar 14, 2026
1 check passed
0xbyt4 added a commit to 0xbyt4/hermes-agent that referenced this pull request Mar 16, 2026
Ported the proven UI and voice logic from the original Web UI (PR NousResearch#327)
adapted for the REST API transport:

UI:
- Glassmorphism theme (purple accent, grid background, glass effects)
- Centered chat container with desktop borders
- Voice waveform bubble player with seek and progress bars
- Markdown rendering with syntax highlighting (marked.js + highlight.js)
- Message animations, typing indicator, auto-scroll
- Mobile responsive design

Voice mode (from old VAD implementation):
- Press mic to enter voice mode (input bar hides, big mic shows)
- VAD silence detection (AnalyserNode, 1.5s silence auto-sends)
- TTS response plays invisibly, then auto-restarts recording
- Echo prevention: stop recording during TTS playback
- Press mic again to exit voice mode
- echoCancellation + noiseSuppression on getUserMedia

Adapted for REST API (was WebSocket):
- ws.send({type:'message'}) -> POST /v1/chat
- ws.send({type:'voice', b64}) -> POST /v1/chat/voice (FormData)
- play_audio event -> response.media[].url
- File upload via POST /v1/upload
@andrueandersoncs

Copy link
Copy Markdown

Implementation Complete ✅

Changes Made

Celebration & Animation:

  • Added confetti animation on first-run profile save using canvas-confetti
  • Respects prefers-reduced-motion for accessibility
  • Dual confetti bursts from left and right with green/gold color scheme

Enhanced Success Alert:

  • New gradient background (green to emerald) with left border accent
  • Party popper icon with bounce animation
  • "Welcome to Vantage!" headline (was: "Your profile is set...")
  • Clearer copy: "Your AI manager is ready to create personalized meal and training plans"
  • Primary CTA: "Build my first week" → links to /weekly-plan (was: "Review Weekly Plan")
  • Secondary CTA: "Go to Today" → links to /

First-Run Form Header:

  • Sparkles icon with amber gradient background
  • "Welcome to Vantage" title (was: "Your baseline profile")
  • Descriptive subtitle: "Tell us a bit about yourself so your AI manager can create personalized plans"

Behavior Changes:

  • First-run saves no longer auto-redirect (users see the celebration)
  • Form title changes to "Your profile is saved" after save (behind the alert)
  • Edit mode unchanged (shows "Edit profile" title)

Files Modified:

  • components/profile-form.tsx - UI enhancements and confetti integration
  • components/profile-form.test.tsx - Updated test assertions
  • components/profile-screen.test.tsx - Updated test assertions

Verification

  • ✅ All 36 profile-related tests passing
  • ✅ Build compiles successfully
  • ✅ Confetti respects reduced-motion preferences

Deployed to Railway: https://vantage-production-b8d9.up.railway.app

angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…f5fb1d3b

fix: salvage PR NousResearch#327 voice mode onto current main
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…f5fb1d3b

fix: salvage PR NousResearch#327 voice mode onto current main
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…f5fb1d3b

fix: salvage PR NousResearch#327 voice mode onto current main
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…f5fb1d3b

fix: salvage PR NousResearch#327 voice mode onto current main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Voice Mode — Speech Input/Output for CLI and Gateway Platforms

3 participants