feat(gateway): add LiveKit WebRTC voice platform support#3894
Conversation
5dc2025 to
d31de19
Compare
|
Looking at the issues. |
b918889 to
3014954
Compare
695b6a3 to
ac7d77b
Compare
a41a431 to
dc9c510
Compare
62e1e40 to
ba9cd70
Compare
cc25732 to
03b1585
Compare
1a4ddb4 to
c903e7b
Compare
Add LiveKit as a new gateway platform, enabling real-time voice conversations with Hermes agents via WebRTC. Supports both self-hosted LiveKit servers and LiveKit Cloud. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Desktop voice-agent clients (and any compatible UI) expect JSON-encoded
agent:* events on the LiveKit data channel to drive UI state: listening
indicator, live user transcript, thinking/speaking indicators, and
conversation log updates.
Emit six events at the appropriate lifecycle points:
- agent:listening-start — VAD detects speech onset (revert if too short)
- agent:listening-stop — silence threshold reached or false alarm
- agent:user-transcript — final ASR result (with transcript + identity)
- agent:thinking-start — just before LLM invocation
- agent:speaking-start — first TTS frame published
- agent:speaking-stop — playback finished (or errored)
- agent:agent-transcript — mirror of every send() payload as the
assistant's final text, so clients can render
the conversation log
Events go out on the default (unnamed) data topic so desktop's topic
router (topic == "hermes-chat" -> plain text, else -> JSON event)
decodes them correctly; "hermes-chat" is still used for the raw text
mirror from send().
Event publishes are wrapped in try/except + logger.debug — UI telemetry
must never break the voice flow.
Three related optimizations so the LiveKit adapter stops holding a
participant slot when nobody's there to talk to:
1. Presence polling (lazy join):
- connect() now calls RoomService.list_participants() via livekit-api.
If the room is empty, the adapter does not join — instead it starts
_presence_watch_loop, a 30s poll that joins as soon as a remote
participant appears.
- Adapter still reports connected to the gateway so send/receive code
paths work when the room is occupied.
2. Auto-leave when alone:
- _on_participant_disconnected checks self._room.remote_participants.
If it's empty, _leave_and_watch tears down the room connection,
cancels the silence task and audio streams, and re-arms the
presence watcher.
- Added a _graceful_leave flag so that room.disconnect() triggered
by leaving intentionally does not also kick off _reconnect_loop.
3. Reconnect retries capped at MAX_RECONNECT_ATTEMPTS (10):
- Previously _reconnect_loop retried forever with a 60s ceiling.
A misconfigured LIVEKIT_URL would spam retries until the process
was restarted. Now we log an error and go idle after the cap.
- _reconnect_loop now calls _join_room (the pure room-join body)
rather than the public connect() entry, so a reconnect doesn't
re-trigger presence polling.
Also: silence-detection loop now uses a 2s interval when no participants
are buffered (was always 200ms), saving a tiny bit of CPU during idle
windows and while presence polling is waiting for someone to arrive.
The old connect() body was extracted into _join_room() — same logic,
different name — so presence polling and reconnect can reuse it.
…, env override LiveKit Cloud has real rate limits and per-minute billing, so 30s between presence checks is a sensible default. Self-hosted LiveKit has neither constraint, so 5s keeps the first-speaker wait short. Detection is by URL: hostname containing ".livekit.cloud" picks the cloud cadence, anything else picks the local cadence. LIVEKIT_PRESENCE_POLL_INTERVAL (seconds, float) overrides either default for users with unusual deployments. The setup wizard now prompts for it with an "empty = auto" hint. Resolved once at adapter construction and logged so the operator can see which cadence is in effect and why.
The earlier commit that added _resolve_presence_poll_interval inserted the new method's body between __init__'s presence_poll_interval call and the trailing init lines (_audio_buffers, _last_audio_time, _audio_streams, _paused, _speaking_participants). The trailing lines ended up *after* the method's `return interval` and so were never executed. Symptom: every participant_disconnected event raised AttributeError: 'LiveKitAdapter' object has no attribute '_audio_streams' inside _cleanup_participant. Because the AttributeError aborted the event handler before the auto-leave check, the room never tore down when the last human left. Move the orphaned init lines back into __init__ where they belong, so the dicts and flags exist on the instance.
_process_voice_input was importing a function that doesn't exist in tools.transcription_tools, which raised ImportError on every utterance and silently aborted the voice pipeline (no transcript, no LLM call, no TTS reply). transcribe_audio already resolves the model from stt config internally when called with no model arg — that's the pattern gateway/run.py and gateway/platforms/discord.py use. Drop the bogus import + helper call and let transcribe_audio do its thing.
The LiveKit adapter only ever joins one room (the configured LIVEKIT_ROOM), so the "home channel" is unambiguous: it's the room. Unlike Discord/Telegram/Slack where a bot lives in many channels and the user has to pick which is "home", LiveKit's single-room architecture makes that choice trivially singular. Previously LIVEKIT_HOME_CHANNEL being unset caused the gateway's first-message onboarding gate to fire, asking the voice user to type /sethome — which a voice-only user can't meaningfully do. It also left cron/cross-platform delivery without a default destination even though one was obviously available (the room). Default LIVEKIT_HOME_CHANNEL to LIVEKIT_ROOM when unset. Set both the env var (for the onboarding-gate's os.getenv check) and the PlatformConfig.home_channel (for runtime delivery resolution). Still honors an explicit LIVEKIT_HOME_CHANNEL override.
Two bugs blocked end-to-end speech on a freshly-joined LiveKit room: 1. _cleanup_participant dropped the audio buffer the instant the track unsubscribed — fine for permanent leaves, but it also fires when a participant's mic transiently drops or, for file-based publishers, when the clip ends. VAD never reached its silence-threshold trigger on the unfinished utterance, so the user's last words were lost. Flush the pending utterance (if it meets MIN_SPEECH_DURATION after trailing-silence trim) through _process_voice_input before tearing down state, so the words still reach STT. 2. voice.auto_tts is False by default — correct for text platforms like Discord/Slack where TTS is opt-in, wrong for LiveKit where the channel itself is audio. A typed-only reply gives the LiveKit user nothing. Override _should_auto_tts_for_chat in LiveKitAdapter so it defaults to True; per-chat /voice off via _auto_tts_disabled_chats still wins. End-to-end test: probe joins the room, publishes a 5s utterance, hears Avery reply with cloud OpenAI TTS through the LiveKit audio track. Confirmed listening-start, listening-stop, user-transcript, thinking-start, speaking-start, speaking-stop all fire in order with real audio frames published back.
c903e7b to
3ffc087
Compare
Match the 2026-05-12 supply-chain hardening policy on main: replace
``livekit>=1.0.17,<2`` / ``livekit-api>=1.0.7,<2`` with exact pins so
PyPI cannot ship a new release into a user's install without an
intentional bump here.
Versions:
livekit==1.1.7 released 2026-04-27, no yanks, well past any
fresh-release risk window (skipped 1.1.8 which
shipped 2026-05-13 — too recent given the
Mini Shai-Hulud quarantine policy).
livekit-api==1.1.0 released 2025-12-02, six months stable. Skipped
1.0.8 (yanked: wrong dependencies).
…) hook Refactor the inlined `re.sub(...)[:4000].strip()` cleanup at the auto-TTS site in `_process_message_background` into an overridable method `BasePlatformAdapter.prepare_tts_text(text: str) -> str`. The default implementation is byte-identical to the previous inline expression — strip `* _ \` # [ ] ( )` and truncate to 4000 chars — so every existing adapter (Telegram, Discord, Slack, Matrix, IRC, etc.) gets exactly the same behaviour as before. Zero behaviour change for any consumer that doesn't override the method. Why add the hook: voice-first platform adapters need stricter cleanup than text-bubble platforms. The default strips a handful of markdown sigils, which is fine when the output goes into a Discord embed or a Telegram message bubble — but read aloud by a TTS engine, URLs (`https://example.com/foo`), fenced code blocks, file paths (`/Users/x/foo.py`), and `MEDIA:` tags turn into long sequences of unintelligible characters. With this hook an adapter can drop those spans before TTS while leaving the data-channel transcript intact for visual rendering. Without the hook, voice adapters have to either - duplicate the auto-TTS flow inside their own `handle_response` pipeline, which means re-implementing the entire `extract_media`, `extract_images`, `extract_local_files`, attachment routing and error-handling sequence in `_process_message_background`, or - live with TTS speaking URLs character-by-character. Both are worse than a 7-line method addition. Example consumer: https://github.com/kortexa-ai/hermes-livekit — LiveKit WebRTC voice gateway plugin. Its `LiveKitAdapter.prepare_tts_text()` additionally strips fenced code blocks, inline code, URLs, file paths, and `MEDIA:` tags before TTS synthesis, while the full response still reaches connected clients via the data channel. Drop-in installable via `pip install git+https://github.com/kortexa-ai/hermes-livekit.git`. Carved out of #3894 (LiveKit WebRTC gateway PR) so the generic hook can land independently of the LiveKit platform itself.
There was a problem hiding this comment.
Pull request overview
Adds a new LiveKit gateway platform adapter to enable real-time, voice-first Hermes conversations over WebRTC, and wires it into the gateway/CLI/tooling so it can be configured, authorized, and used alongside existing messaging platforms.
Changes:
- Introduces
gateway/platforms/livekit.pyimplementing LiveKit room join/presence polling, inbound audio buffering → STT → agent loop, and outbound TTS playback + data-channel text. - Adds a
BasePlatformAdapter.prepare_tts_text()hook and a LiveKit-specific override to filter spoken TTS content. - Extends configuration/CLI integration: new
livekitoptional dependency extra, setup-wizard auto-install support, status/config platform listing updates, and gateway adapter creation + auth env mapping.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
toolsets.py |
Adds hermes-livekit toolset and includes it in hermes-gateway. |
tools/send_message_tool.py |
Explicitly blocks send_message routing to LiveKit with a clear error. |
pyproject.toml |
Adds livekit optional-deps group and includes it in all. |
hermes_cli/status.py |
Introduces shared MESSAGING_PLATFORMS registry and adds LiveKit to status display. |
hermes_cli/platforms.py |
Registers LiveKit in the CLI platform list with a default toolset. |
hermes_cli/gateway.py |
Adds LiveKit to setup wizard, adds extras keys for auto-install, and masks password fields when echoing existing values. |
hermes_cli/config.py |
Uses MESSAGING_PLATFORMS to print configured messaging platforms. |
gateway/run.py |
Creates LiveKit adapter and adds LiveKit to authorization env maps. |
gateway/platforms/livekit.py |
New LiveKit adapter implementing voice I/O, STT, TTS, data-channel messaging, and presence-aware join/leave. |
gateway/platforms/base.py |
Adds prepare_tts_text() hook and routes auto-TTS through it. |
gateway/config.py |
Adds Platform.LIVEKIT, adds connected-checker entry, and loads LiveKit config from env into gateway config (including home channel defaulting). |
agent/prompt_builder.py |
Adds LiveKit-specific prompt hint for voice-first, concise responses. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "Email": ("EMAIL_ADDRESS", "EMAIL_HOME_ADDRESS"), | ||
| "SMS": ("TWILIO_ACCOUNT_SID", "SMS_HOME_CHANNEL"), | ||
| "Mattermost": ("MATTERMOST_URL", None), | ||
| "Matrix": ("MATRIX_HOMESERVER_URL", None), |
| platforms = { | ||
| "Telegram": ("TELEGRAM_BOT_TOKEN", "TELEGRAM_HOME_CHANNEL"), | ||
| "Discord": ("DISCORD_BOT_TOKEN", "DISCORD_HOME_CHANNEL"), | ||
| "WhatsApp": ("WHATSAPP_ENABLED", None), | ||
| "Signal": ("SIGNAL_HTTP_URL", "SIGNAL_HOME_CHANNEL"), | ||
| "Slack": ("SLACK_BOT_TOKEN", None), | ||
| "Email": ("EMAIL_ADDRESS", "EMAIL_HOME_ADDRESS"), | ||
| "SMS": ("TWILIO_ACCOUNT_SID", "SMS_HOME_CHANNEL"), | ||
| "DingTalk": ("DINGTALK_CLIENT_ID", None), | ||
| "Feishu": ("FEISHU_APP_ID", "FEISHU_HOME_CHANNEL"), | ||
| "WeCom": ("WECOM_BOT_ID", "WECOM_HOME_CHANNEL"), | ||
| "WeCom Callback": ("WECOM_CALLBACK_CORP_ID", None), | ||
| "Weixin": ("WEIXIN_ACCOUNT_ID", "WEIXIN_HOME_CHANNEL"), | ||
| "BlueBubbles": ("BLUEBUBBLES_SERVER_URL", "BLUEBUBBLES_HOME_CHANNEL"), | ||
| "QQBot": ("QQ_APP_ID", "QQ_HOME_CHANNEL"), | ||
| "Yuanbao": ("YUANBAO_APP_ID", "YUANBAO_HOME_CHANNEL"), | ||
| "LiveKit": ("LIVEKIT_URL", None), | ||
| } |
| for name, (token_var, _home_var) in MESSAGING_PLATFORMS.items(): | ||
| configured = bool(get_env_value(token_var)) |
| (cfg.extra.get("client_id") or os.getenv("DINGTALK_CLIENT_ID")) | ||
| and (cfg.extra.get("client_secret") or os.getenv("DINGTALK_CLIENT_SECRET")) | ||
| ), | ||
| Platform.LIVEKIT: lambda cfg: bool(cfg.extra.get("url")), |
| from gateway.platforms.livekit import LiveKitAdapter, check_livekit_requirements | ||
| if not check_livekit_requirements(): | ||
| logger.warning("LiveKit: livekit SDK not installed or LIVEKIT_URL/API_KEY/API_SECRET not set") |
| # Initialize buffer for this participant | ||
| self._audio_buffers[identity] = bytearray() | ||
| self._last_audio_time[identity] = time.monotonic() |
| silence_bytes = int(SILENCE_THRESHOLD_SECONDS * SAMPLE_RATE * NUM_CHANNELS * 2) | ||
| speech_end = max(0, len(buf) - silence_bytes) |
| config.platforms[Platform.LIVEKIT].extra.update({ | ||
| "url": livekit_url, | ||
| "api_key": livekit_api_key, | ||
| "api_secret": livekit_api_secret, | ||
| "room": livekit_room, | ||
| "agent_name": os.getenv("LIVEKIT_AGENT_NAME", "Hermes"), | ||
| "agent_avatar": os.getenv("LIVEKIT_AGENT_AVATAR", ""), | ||
| }) |
|
Closing this in favor of the The plugin packages the same LiveKit adapter as a pip-installable Why we're going plugin-only:
The two genuine adapter bugs Copilot flagged (eager Thanks for the reviews — they carried over usefully. |
…ce (0.3.1) Two voice-path bugs surfaced by review feedback on the now-closed core PR NousResearch/hermes-agent#3894: - _on_track_subscribed seeded _last_audio_time on subscribe, defeating the _check_silence_loop guard that treats a missing entry as "never spoke" and discards accumulated noise. A participant publishing only silence would accrue a stale timestamp and eventually trip STT on silence. The timestamp is now set only on the first chunk above the RMS floor. - The track-end utterance flush in _cleanup_participant computed speech_end = max(0, len(buf) - silence_bytes), unconditionally trimming a fixed silence window. When a track ends right after a word with no trailing silence buffered, that chops real speech or zeroes the flush. On track end the flush now transcribes the whole buffer — trailing silence handed to STT is harmless, lost words are not. Also corrects a stale prepare_tts_text docstring: upstream BasePlatformAdapter now calls the hook (landed via NousResearch/hermes-agent#27308), so the override is live.
…) hook Refactor the inlined `re.sub(...)[:4000].strip()` cleanup at the auto-TTS site in `_process_message_background` into an overridable method `BasePlatformAdapter.prepare_tts_text(text: str) -> str`. The default implementation is byte-identical to the previous inline expression — strip `* _ \` # [ ] ( )` and truncate to 4000 chars — so every existing adapter (Telegram, Discord, Slack, Matrix, IRC, etc.) gets exactly the same behaviour as before. Zero behaviour change for any consumer that doesn't override the method. Why add the hook: voice-first platform adapters need stricter cleanup than text-bubble platforms. The default strips a handful of markdown sigils, which is fine when the output goes into a Discord embed or a Telegram message bubble — but read aloud by a TTS engine, URLs (`https://example.com/foo`), fenced code blocks, file paths (`/Users/x/foo.py`), and `MEDIA:` tags turn into long sequences of unintelligible characters. With this hook an adapter can drop those spans before TTS while leaving the data-channel transcript intact for visual rendering. Without the hook, voice adapters have to either - duplicate the auto-TTS flow inside their own `handle_response` pipeline, which means re-implementing the entire `extract_media`, `extract_images`, `extract_local_files`, attachment routing and error-handling sequence in `_process_message_background`, or - live with TTS speaking URLs character-by-character. Both are worse than a 7-line method addition. Example consumer: https://github.com/kortexa-ai/hermes-livekit — LiveKit WebRTC voice gateway plugin. Its `LiveKitAdapter.prepare_tts_text()` additionally strips fenced code blocks, inline code, URLs, file paths, and `MEDIA:` tags before TTS synthesis, while the full response still reaches connected clients via the data channel. Drop-in installable via `pip install git+https://github.com/kortexa-ai/hermes-livekit.git`. Carved out of NousResearch#3894 (LiveKit WebRTC gateway PR) so the generic hook can land independently of the LiveKit platform itself.
Note: This PR was authored and tested by Avery — a Hermes agent, with guidance from @francip.
What Changed
Added LiveKit as a new gateway platform adapter, enabling real-time voice conversations with Hermes agents via WebRTC. Users can talk to their agent through any LiveKit-compatible client (browser, mobile, CLI).
New file:
gateway/platforms/livekit.py(~600 lines)Modified files:
pyproject.toml,gateway/config.py,gateway/run.py,gateway/platforms/base.py,gateway/channel_directory.py,toolsets.py,agent/prompt_builder.py,cron/scheduler.py,tools/send_message_tool.py,tools/cronjob_tools.py,hermes_cli/gateway.py,hermes_cli/status.py,hermes_cli/config.py,hermes_cli/tools_config.py,hermes_cli/skills_config.pyChanges
gateway/platforms/livekit.py): Full platform adapter with WebRTC audio capture, silence detection, STT transcription via hermes's existing pipeline (faster-whisper/Groq/OpenAI), TTS playback back to the room, data channel for text responses, reconnection with exponential backoff, and agent name resolution via LLM fallbacklivekitextras group inpyproject.toml— only installed if the user enables LiveKithermes gateway setup, the wizard now offers to install the required Python packages automatically. This is a generic mechanism — any platform with an"extras"key in its_PLATFORMSentry gets the promptprepare_tts_text()method toBasePlatformAdapterso subclasses can filter text before TTS generation. LiveKit adapter strips code blocks, URLs, file paths, and MEDIA tags — full text goes to data channel, only conversational content is spokenMESSAGING_PLATFORMSdict fromhermes_cli/status.pyso bothhermes statusandhermes configuse the same list************instead of raw values for password fields when displaying current configurationgateway/platforms/ADDING_A_PLATFORM.mdcovered: Platform enum, env var loading, adapter factory, authorization maps, platform hints, toolsets, cron delivery, send_message routing, channel directory, status display, setup wizard, tools_config, skills_config, cronjob schemaWhy
Voice is a natural interface for AI agents. LiveKit provides an open-source WebRTC SFU that can be self-hosted on a $5 VPS or used via LiveKit Cloud, making it accessible to all Hermes users. The adapter reuses hermes's existing STT/TTS infrastructure (no new audio dependencies beyond the LiveKit SDK), keeping the implementation minimal and consistent with the Discord voice pipeline.
Configuration
Three environment variables:
LIVEKIT_URL=wss://your-project.livekit.cloud # or ws://your-server:7880 LIVEKIT_API_KEY=your-api-key LIVEKIT_API_SECRET=your-api-secretOptional:
Setup wizard
hermes gateway setup→ select LiveKit → enter credentials → auto-install dependencies:Working voice conversation
Tested with LiveKit Meet (browser client) connecting to a self-hosted LiveKit server
LiveKit Cloud
Also tested with LiveKit Cloud — same adapter, just a different URL:
How to Test
Install and configure:
Test voice:
lk room join --open meet)lk room join --publish-microphone --url <url> --api-key <key> --api-secret <secret> <room>)Test setup wizard:
hermes gateway setup # Select LiveKit, enter credentials, confirm auto-installVerify status:
Validation
Automated:
source venv/bin/activate python -m pytest tests/hermes_cli/test_tools_config.py::TestPlatformToolsetConsistency -vResult:
3 passed(platform consistency tests cover LiveKit in toolsets, tools_config, and skills_config)Full test suite: 7062 passed, 17 failed (all 17 failures are pre-existing, unrelated to this PR)
Known Limitations (v1)
failed to initialize pc). Connecting to a remote Linux LiveKit server from macOS works fine. This is an upstream issue inlivekit/python-sdks.Platforms Tested
lk room join)