feat: Voice Mode — CLI, Telegram, Discord (text + VC), and Web UI with full voice support (Issue #314)#327
Conversation
|
Really cool feature concept — voice mode would be a great addition to Hermes. A few concerns about cross-platform/environment compatibility before this could be merged: Audio dependency fragility (sounddevice + PortAudio):
Key binding conflict:
Core agent loop changes:
What we'd need to feel comfortable merging:
The feature itself is genuinely exciting — just needs to be bulletproof for the environments Hermes runs in. Happy to review again once these are addressed! |
|
hi @teknium1 thank you for review and addressed issues solved:
|
|
Thanks for addressing the original feedback — lazy imports, graceful degradation, Ctrl+B default, and environment detection all look good. A few remaining issues before we can merge: Bug A: Streaming TTS can never activate (stale imports)
from tools.tts_tool import (
_load_tts_config as _load_tts_cfg,
_get_provider as _get_prov,
_HAS_ELEVENLABS as _el_ok,
_HAS_AUDIO as _audio_ok,
stream_tts_to_speaker,
)These module-level booleans no longer exist — they were removed when you switched to lazy import functions ( Fix: Replace the boolean checks with calls to the lazy import helpers, e.g.: try:
from tools.tts_tool import (
_load_tts_config as _load_tts_cfg,
_get_provider as _get_prov,
_import_elevenlabs,
_import_sounddevice,
stream_tts_to_speaker,
)
_tts_cfg = _load_tts_cfg()
_el_ok = False
_audio_ok = False
try:
_import_elevenlabs()
_el_ok = True
except ImportError:
pass
try:
_import_sounddevice()
_audio_ok = True
except (ImportError, OSError):
pass
if _get_prov(_tts_cfg) == "elevenlabs" and _el_ok and _audio_ok:
use_streaming_tts = True
except Exception:
passBug B: Voice mode system prompt is a no-op
Important: Even if you fix the propagation, modifying the system prompt mid-conversation would break prompt caching (the cache prefix becomes invalid). This is a core policy — see AGENTS.md. Suggested fix: Instead of modifying the system prompt, inject the voice mode instruction as a user message prefix when voice input is submitted. Something like prepending Bug C: Branch needs rebaseThe branch is ~20 commits behind main (merge base is Minor concern: _vprint suppresses error messagesThe |
e03e481 to
b16527c
Compare
Update: Telegram Gateway Voice Mode + Critical Bug FixBug Fix:
|
11d431f to
44d661f
Compare
dfb8595 to
deaf36e
Compare
fd10f94 to
9837473
Compare
Code Review — Round 3First off, really impressive work here — the scope and quality of the CLI voice integration, streaming TTS pipeline, and Discord VC implementation show real engineering skill. The thread safety patterns, lazy imports, and graceful degradation are all well done. Bugs A and B from the previous review are properly fixed with regression tests. 👏 That said, there are several issues that need to be addressed before this can merge: 🔴 Blocking Issues1. Rebase Regression: During the rebase, the Anthropic-aware interrupt handler was accidentally moved FROM
This affects ALL Anthropic users on interrupt (not just voice users). On interrupt: wrong client closed, wrong client rebuilt, token generation continues on the Anthropic side. This is a critical regression. 2. Web Gateway — Path Traversal in File Uploads The upload handler uses the user-supplied filename without sanitization: orig_name = field.filename or "file"
filename = f"upload_{uuid.uuid4().hex[:8]}_{orig_name}"
dest = self._media_dir / filenameIf 3. Web Gateway — Unauthenticated The static file serving via 4. Web Gateway — XSS via Bot messages are rendered via 5. Web Gateway — Token Exposed via The 6. Branch is 358 commits behind main The merge base is 🟡 Should-Fix
🟢 What's Done Well
Path ForwardGiven the 358-commit gap and the security issues in the web gateway, a full rebase would be very challenging. One option would be to split this into smaller, focused PRs:
This would make each piece easier to rebase, review, and merge independently. Happy to help with any of this! |
9837473 to
5fdf0e3
Compare
Round 3 Review — All Issues AddressedThanks for the thorough review @teknium1 All blocking and should-fix items have been resolved. Blocking Issues (6/6 Fixed)
Should-Fix Items (7/7 Fixed)
Additional Fixes (this session)
Test CoverageAll fixes have corresponding tests:
Rebase
PR SplittingConsidered splitting into 4 PRs as suggested, but decided against it , the features are tightly coupled:
Splitting would mean duplicating shared code or creating artificial boundaries. All commits are already logically grouped and the rebase is clean. |
522494e to
d9df64c
Compare
Review — Round 4Great work addressing all the Round 3 feedback — the security fixes, Anthropic interrupt handler, and overall code quality are solid. The lazy imports, thread safety, and streaming TTS architecture are genuinely well-engineered. However, we need to make some scope changes before this can merge: Web UI Gateway — Please RemoveWe are building our own official chat UI and dashboard for Hermes Agent. We cannot accept the web gateway ( We should have been clearer about this in Round 3 — we suggested splitting the web UI into a separate PR but then gave detailed security feedback on it, which sent mixed signals. Apologies for that. Please remove all web UI related code from this PR:
If you want the web UI considered separately, feel free to open a new PR for it — but it will likely conflict with our own UI plans. Remaining Issues to Fix (voice mode code)With the web UI removed, these items remain: 1. 2. 3. 4. Unrelated Once these are addressed and the web UI code is removed, this should be ready to merge. The CLI voice mode, gateway voice reply, and Discord VC features are strong work. 🎉 |
0e45dcd to
d43ef9c
Compare
|
Thank you @teknium1 for reviewing ! Round 4 — All Issues AddressedWeb UI — RemovedCompletely removed all web UI code from the PR:
Fix 1:
|
|
Thanks — this is much closer now, and removing the web UI scope was the right call. I re-reviewed the current branch against main and there are still a few required fixes before we can merge:
Required fix:
Required fix:
Required fix:
Required fix:
Once those are fixed, this looks very close. The core CLI voice work is strong — just need these last correctness issues cleaned up. |
f487cdd to
f7b3411
Compare
- Patch WEB_UI_HOST in test_web_defaults to avoid env leak - Handle empty WEB_UI_HOST string in config (fall back to 127.0.0.1)
- Change RTP packet logging from INFO to DEBUG level to reduce noise
(SPEAKING events remain at INFO as they are important lifecycle events)
- Use per-session chat_id (web_{session_id}) instead of shared "web"
to isolate conversation context between simultaneous web users
Merge main's faster-whisper (local, free) with our Groq support into a unified three-provider STT pipeline: local > groq > openai. Provider priority ensures free options are tried first. Each provider has its own transcriber function with model auto-correction, env- overridable endpoints, and proper error handling. 74 tests cover the full provider matrix, fallback chains, model correction, config loading, validation edge cases, and dispatch.
Voice status was hardcoded to check API keys only. Now uses the actual provider resolution (local/groq/openai) so it correctly shows "local faster-whisper" when installed instead of "Groq" or "MISSING".
Move stream close outside the lock in shutdown() to prevent deadlock when audio callback tries to acquire the same lock. Replace single t.join(timeout) with a polling loop (0.1s intervals) so KeyboardInterrupt is not blocked during stream cleanup.
…ider key - web.py: pass stt_model from config like discord.py and run.py do - run.py: match new error messages (No STT provider / not set) - _transcribe_local: add missing "provider": "local" to return dict
…tate into agent context
…rface issues Remove web UI gateway (web.py, tests, docs, toolset, env vars, Platform.WEB enum) per maintainer request — Nous is building their own official chat UI. Fix 1: Replace sd.wait() with polling pattern in play_audio_file() to prevent indefinite hang when audio device stalls (consistent with play_beep()). Fix 2: Use importlib.util.find_spec() for faster_whisper/openai availability checks instead of module-level imports that trigger heavy native library loading (CUDA/cuDNN) at import time. Fix 3: Remove inspect.signature() hack in _send_voice_reply() — add **kwargs to Telegram send_voice() so all adapters accept metadata uniformly. Fix 4: Make session loading resilient to removed platform enum values — skip entries with unknown platforms instead of crashing the entire gateway.
…efix, auto-TTS control 1. Gate _streaming_api_call to chat_completions mode only — Anthropic and Codex fall back to _interruptible_api_call. Preserve Anthropic base_url across all client rebuild paths (interrupt, fallback, 401 refresh). 2. Discord VC synthetic events now use chat_type="channel" instead of defaulting to "dm" — prevents session bleed into DM context. Authorization runs before echoing transcript. Sanitize @everyone/@here in voice transcripts. 3. CLI voice prefix ("[Voice input...]") is now API-call-local only — stripped from returned history so it never persists to session DB or resumed sessions. 4. /voice off now disables base adapter auto-TTS via _auto_tts_disabled_chats set — voice input no longer triggers TTS when voice mode is off.
…response The mock's app_commands SimpleNamespace lacked choices and Choice attrs, causing xdist test ordering failures when this mock loaded before test_discord_slash_commands.
1. Anthropic + ElevenLabs TTS silence: forward full response to TTS callback for non-streaming providers (choices first, then native content blocks fallback). 2. Subprocess timeout kill: play_audio_file now kills the process on TimeoutExpired instead of leaving zombie processes. 3. Discord disconnect cleanup: leave all voice channels before closing the client to prevent leaked state. 4. Audio stream leak: close InputStream if stream.start() fails. 5. Race condition: read/write _on_silence_stop under lock in audio callback thread. 6. _vprint force=True: show API error, retry, and truncation messages even during streaming TTS. 7. _refresh_level lock: read _voice_recording under _voice_lock.
The rebase added voice prompt checks to _get_tui_prompt_fragments but the test stub was missing _voice_recording, _voice_processing and _voice_mode attributes, causing AttributeError.
a7f86ca to
92c14ec
Compare
fix: salvage PR #327 voice mode onto current main
Ported the proven UI and voice logic from the original Web UI (PR NousResearch#327) adapted for the REST API transport: UI: - Glassmorphism theme (purple accent, grid background, glass effects) - Centered chat container with desktop borders - Voice waveform bubble player with seek and progress bars - Markdown rendering with syntax highlighting (marked.js + highlight.js) - Message animations, typing indicator, auto-scroll - Mobile responsive design Voice mode (from old VAD implementation): - Press mic to enter voice mode (input bar hides, big mic shows) - VAD silence detection (AnalyserNode, 1.5s silence auto-sends) - TTS response plays invisibly, then auto-restarts recording - Echo prevention: stop recording during TTS playback - Press mic again to exit voice mode - echoCancellation + noiseSuppression on getUserMedia Adapted for REST API (was WebSocket): - ws.send({type:'message'}) -> POST /v1/chat - ws.send({type:'voice', b64}) -> POST /v1/chat/voice (FormData) - play_audio event -> response.media[].url - File upload via POST /v1/upload
Implementation Complete ✅Changes MadeCelebration & Animation:
Enhanced Success Alert:
First-Run Form Header:
Behavior Changes:
Files Modified:
Verification
Deployed to Railway: https://vantage-production-b8d9.up.railway.app |
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
…f5fb1d3b fix: salvage PR NousResearch#327 voice mode onto current main
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
…f5fb1d3b fix: salvage PR NousResearch#327 voice mode onto current main
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
…f5fb1d3b fix: salvage PR NousResearch#327 voice mode onto current main
Merge contributor branch feature/voice-mode onto current main for follow-up fixes.
…f5fb1d3b fix: salvage PR NousResearch#327 voice mode onto current main



Summary
Implements Voice Mode for the Hermes CLI (Issue #314, Phases 2-5). Users can speak to the agent via microphone and optionally hear responses read aloud via TTS — with sentence-by-sentence streaming for ElevenLabs.
Note: Phase 1 (Gateway voice messages) was already implemented — Telegram, Discord, WhatsApp, and Slack all handle incoming voice messages with auto-transcription.
What's New
Phase 2: CLI Voice Input
/voiceslash command to toggle voice mode on/offsounddevice+numpy(optional deps viapip install hermes-agent[voice])VOICE_TOOLS_OPENAI_KEY) and Groq Whisper (GROQ_API_KEY) with automatic model correction per provider● ▃ ❯)Phase 3: TTS Response Output
/voice ttssub-toggle to read agent responses aloudtext_to_speechtool infrastructurePhase 4: Low-Latency Features
subprocess.Popen(notrun) sostop_playback()can terminate itvoice.silence_thresholdandvoice.silence_durationinconfig.yamlPhase 5: Streaming TTS (ElevenLabs)
pcm_24000→sounddevice.OutputStream→ speaker.!?\n\n), with 20-char minimum to merge short fragments and 100-char timeout flush for long sentences without punctuation<think>...</think>content is stripped in real-time so reasoning tokens are never spokenrun_agent._interruptible_api_call()usesstream=Truewhen callback is set, accumulates chunks into a mockChatCompletionresponse (same interface as non-streaming)elevenlabsorsounddeviceis not installed, falls back to batch TTS automaticallystream_callbackdefaults toNone, API calls stay non-streamingeleven_flash_v2_5(~75ms latency) by default, configurable viatts.elevenlabs.streaming_model_idin config.yamlDesign Decisions
Why Ctrl+R toggle instead of Space-bar hold-to-talk?
The issue suggested hold-Space, but
prompt_toolkitdoesn't support key-up events, making hold-to-talk infeasible. Ctrl+R toggle combined with silence detection provides a better UX — the user presses once, speaks naturally, and recording auto-stops when they're done. No need to hold anything.Why not streaming STT?
OpenAI Whisper API and Groq Whisper API don't support streaming transcription. Adding a streaming provider (Deepgram/AssemblyAI) would require new service dependencies. The current approach (record → auto-stop on silence → transcribe) is reliable and keeps the dependency footprint small.
Why only ElevenLabs for streaming TTS?
ElevenLabs returns raw PCM chunks that can be written directly to
sounddevice.OutputStreamfor zero-copy playback. Edge TTS is async and outputs MP3 (needs decoding), OpenAI TTS returns complete files. The streaming architecture requires chunk-by-chunk audio iteration which only ElevenLabs supports natively.CoreAudio safety
On macOS,
sd.play()(beep) andsd.InputStream(recording) conflict when running simultaneously (PaMacCore error). Beeps are played synchronously BEFORE starting the recording stream to avoid this.Quick Usage
Streaming TTS Setup (optional):
In
~/.hermes/config.yaml:STT Provider: Tested with Groq Whisper (
GROQ_API_KEY), also supports OpenAI Whisper (VOICE_TOOLS_OPENAI_KEY). Groq's free tier works well for this.Install:
pip install hermes-agent[voice](addssounddevice+numpy)Files Changed
tools/voice_mode.pytools/tts_tool.pystream_tts_to_speaker()— sentence buffer, think block filter, markdown stripping, ElevenLabs PCM streaming to sounddevicetools/transcription_tools.pyrun_agent.pystream_callbackparam torun_conversation()/chat(), streaming path in_interruptible_api_call()with mock ChatCompletion assemblycli.pyhermes_cli/config.pyhermes_cli/commands.py/voicecommand registrationpyproject.toml[voice]optional dependency group.env.exampleGROQ_API_KEYdocumentationtests/tools/test_voice_mode.pytests/tools/test_transcription_tools.pytests/tools/test_voice_cli_integration.pyTest Plan
test_voice_mode.py+test_transcription_tools.py+test_voice_cli_integration.py)test_cli_init.py+test_run_agent.py)/voice on→ Ctrl+R → speak → auto-stop → transcription → agent response/voice ttsreads responses aloudCloses #314