Skip to content

feat(gateway): add LiveKit WebRTC voice platform support#3894

Closed
francip wants to merge 9 commits into
NousResearch:mainfrom
kortexa-ai:kortexa/gateway-livekit
Closed

feat(gateway): add LiveKit WebRTC voice platform support#3894
francip wants to merge 9 commits into
NousResearch:mainfrom
kortexa-ai:kortexa/gateway-livekit

Conversation

@francip

@francip francip commented Mar 30, 2026

Copy link
Copy Markdown
Contributor

⚠️ Superseded — closed. This work now ships as the standalone hermes-livekit plugin (pip-installable, no core patches required, and already ahead with a v0.3.0 remote-tools protocol). See the closing comment below for the rationale.

Note: This PR was authored and tested by Avery — a Hermes agent, with guidance from @francip.


What Changed

Added LiveKit as a new gateway platform adapter, enabling real-time voice conversations with Hermes agents via WebRTC. Users can talk to their agent through any LiveKit-compatible client (browser, mobile, CLI).

New file: gateway/platforms/livekit.py (~600 lines)

Modified files: pyproject.toml, gateway/config.py, gateway/run.py, gateway/platforms/base.py, gateway/channel_directory.py, toolsets.py, agent/prompt_builder.py, cron/scheduler.py, tools/send_message_tool.py, tools/cronjob_tools.py, hermes_cli/gateway.py, hermes_cli/status.py, hermes_cli/config.py, hermes_cli/tools_config.py, hermes_cli/skills_config.py

Changes

  • LiveKit adapter (gateway/platforms/livekit.py): Full platform adapter with WebRTC audio capture, silence detection, STT transcription via hermes's existing pipeline (faster-whisper/Groq/OpenAI), TTS playback back to the room, data channel for text responses, reconnection with exponential backoff, and agent name resolution via LLM fallback
  • Optional dependency: livekit extras group in pyproject.toml — only installed if the user enables LiveKit
  • Auto-install in setup wizard: When configuring LiveKit (or Matrix, DingTalk, Feishu) via hermes gateway setup, the wizard now offers to install the required Python packages automatically. This is a generic mechanism — any platform with an "extras" key in its _PLATFORMS entry gets the prompt
  • TTS filtering hook: Added prepare_tts_text() method to BasePlatformAdapter so subclasses can filter text before TTS generation. LiveKit adapter strips code blocks, URLs, file paths, and MEDIA tags — full text goes to data channel, only conversational content is spoken
  • Shared platform registry: Extracted MESSAGING_PLATFORMS dict from hermes_cli/status.py so both hermes status and hermes config use the same list
  • Secret masking in setup: The setup wizard now shows ************ instead of raw values for password fields when displaying current configuration
  • All 16 integration points from gateway/platforms/ADDING_A_PLATFORM.md covered: Platform enum, env var loading, adapter factory, authorization maps, platform hints, toolsets, cron delivery, send_message routing, channel directory, status display, setup wizard, tools_config, skills_config, cronjob schema

Why

Voice is a natural interface for AI agents. LiveKit provides an open-source WebRTC SFU that can be self-hosted on a $5 VPS or used via LiveKit Cloud, making it accessible to all Hermes users. The adapter reuses hermes's existing STT/TTS infrastructure (no new audio dependencies beyond the LiveKit SDK), keeping the implementation minimal and consistent with the Discord voice pipeline.

Configuration

Three environment variables:

LIVEKIT_URL=wss://your-project.livekit.cloud   # or ws://your-server:7880
LIVEKIT_API_KEY=your-api-key
LIVEKIT_API_SECRET=your-api-secret

Optional:

LIVEKIT_ROOM=hermes              # Room name (default: "hermes")
LIVEKIT_AGENT_NAME=Avery         # Display name (default: asks the LLM, falls back to "Hermes")
LIVEKIT_AGENT_AVATAR=https://... # Avatar URL for LiveKit clients
LIVEKIT_ALLOW_ALL_USERS=true     # Authorization

Setup wizard

hermes gateway setup → select LiveKit → enter credentials → auto-install dependencies:

Screenshot 2026-03-29 at 21 34 30

Working voice conversation

Tested with LiveKit Meet (browser client) connecting to a self-hosted LiveKit server

LiveKit Cloud

Also tested with LiveKit Cloud — same adapter, just a different URL:

Screenshot 2026-03-29 at 23 22 05 Screenshot 2026-03-29 at 23 20 52

How to Test

Install and configure:

# Install LiveKit dependencies
pip install 'hermes-agent[livekit]'

# Configure (or use hermes gateway setup)
echo 'LIVEKIT_URL=wss://your-server' >> ~/.hermes/.env
echo 'LIVEKIT_API_KEY=your-key' >> ~/.hermes/.env
echo 'LIVEKIT_API_SECRET=your-secret' >> ~/.hermes/.env

# Start gateway
hermes gateway start

Test voice:

  1. Connect to the room using any LiveKit client:
    • Browser: LiveKit Meet (lk room join --open meet)
    • CLI: lk room join --publish-microphone --url <url> --api-key <key> --api-secret <secret> <room>)
  2. Speak — the agent should transcribe, process, and respond via voice
  3. Text responses also appear on the data channel

Test setup wizard:

hermes gateway setup
# Select LiveKit, enter credentials, confirm auto-install

Verify status:

hermes status    # Shows LiveKit as configured
hermes config    # Shows LiveKit in platform list

Validation

Automated:

source venv/bin/activate
python -m pytest tests/hermes_cli/test_tools_config.py::TestPlatformToolsetConsistency -v

Result: 3 passed (platform consistency tests cover LiveKit in toolsets, tools_config, and skills_config)

Full test suite: 7062 passed, 17 failed (all 17 failures are pre-existing, unrelated to this PR)

Known Limitations (v1)

  • macOS native LiveKit server: The Python SDK's bundled WebRTC binary has a known bug on macOS 26 Tahoe (failed to initialize pc). Connecting to a remote Linux LiveKit server from macOS works fine. This is an upstream issue in livekit/python-sdks.
  • Voice only: Text chat via data channel is best-effort (send only). Full bidirectional text is v2.
  • Single room: The adapter joins one room. Multi-room support is v2.
  • No avatar display: LiveKit Meet doesn't render participant avatars. The metadata is set for future custom clients.
  • Background noise sensitivity: The silence detection uses simple RMS energy thresholds. WebRTC VAD (Silero/livekit-agents) would improve accuracy — v2.

Platforms Tested

  • ✅ macOS (snappy) → self-hosted LiveKit on Linux (smarty) via LAN
  • ✅ macOS (snappy) → LiveKit Cloud
  • ✅ Browser client (LiveKit Meet) with microphone
  • ✅ CLI client (lk room join)

@francip francip force-pushed the kortexa/gateway-livekit branch 3 times, most recently from 5dc2025 to d31de19 Compare April 3, 2026 16:40
@francip

francip commented Apr 3, 2026

Copy link
Copy Markdown
Contributor Author

Looking at the issues.

@francip francip force-pushed the kortexa/gateway-livekit branch 5 times, most recently from b918889 to 3014954 Compare April 9, 2026 00:18
@francip francip force-pushed the kortexa/gateway-livekit branch 2 times, most recently from 695b6a3 to ac7d77b Compare April 15, 2026 06:45
@francip francip force-pushed the kortexa/gateway-livekit branch 4 times, most recently from a41a431 to dc9c510 Compare April 21, 2026 17:26
@alt-glitch alt-glitch added type/feature New feature or request comp/gateway Gateway runner, session dispatch, delivery tool/tts Text-to-speech and transcription labels Apr 21, 2026
@francip francip force-pushed the kortexa/gateway-livekit branch 3 times, most recently from 62e1e40 to ba9cd70 Compare April 28, 2026 14:31
@francip francip force-pushed the kortexa/gateway-livekit branch from cc25732 to 03b1585 Compare April 30, 2026 18:55
@francip francip force-pushed the kortexa/gateway-livekit branch 2 times, most recently from 1a4ddb4 to c903e7b Compare May 11, 2026 04:46
francip and others added 6 commits May 14, 2026 20:15
Add LiveKit as a new gateway platform, enabling real-time voice
conversations with Hermes agents via WebRTC. Supports both
self-hosted LiveKit servers and LiveKit Cloud.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Desktop voice-agent clients (and any compatible UI) expect JSON-encoded
agent:* events on the LiveKit data channel to drive UI state: listening
indicator, live user transcript, thinking/speaking indicators, and
conversation log updates.

Emit six events at the appropriate lifecycle points:
- agent:listening-start  — VAD detects speech onset (revert if too short)
- agent:listening-stop   — silence threshold reached or false alarm
- agent:user-transcript  — final ASR result (with transcript + identity)
- agent:thinking-start   — just before LLM invocation
- agent:speaking-start   — first TTS frame published
- agent:speaking-stop    — playback finished (or errored)
- agent:agent-transcript — mirror of every send() payload as the
                           assistant's final text, so clients can render
                           the conversation log

Events go out on the default (unnamed) data topic so desktop's topic
router (topic == "hermes-chat" -> plain text, else -> JSON event)
decodes them correctly; "hermes-chat" is still used for the raw text
mirror from send().

Event publishes are wrapped in try/except + logger.debug — UI telemetry
must never break the voice flow.
Three related optimizations so the LiveKit adapter stops holding a
participant slot when nobody's there to talk to:

1. Presence polling (lazy join):
   - connect() now calls RoomService.list_participants() via livekit-api.
     If the room is empty, the adapter does not join — instead it starts
     _presence_watch_loop, a 30s poll that joins as soon as a remote
     participant appears.
   - Adapter still reports connected to the gateway so send/receive code
     paths work when the room is occupied.

2. Auto-leave when alone:
   - _on_participant_disconnected checks self._room.remote_participants.
     If it's empty, _leave_and_watch tears down the room connection,
     cancels the silence task and audio streams, and re-arms the
     presence watcher.
   - Added a _graceful_leave flag so that room.disconnect() triggered
     by leaving intentionally does not also kick off _reconnect_loop.

3. Reconnect retries capped at MAX_RECONNECT_ATTEMPTS (10):
   - Previously _reconnect_loop retried forever with a 60s ceiling.
     A misconfigured LIVEKIT_URL would spam retries until the process
     was restarted. Now we log an error and go idle after the cap.
   - _reconnect_loop now calls _join_room (the pure room-join body)
     rather than the public connect() entry, so a reconnect doesn't
     re-trigger presence polling.

Also: silence-detection loop now uses a 2s interval when no participants
are buffered (was always 200ms), saving a tiny bit of CPU during idle
windows and while presence polling is waiting for someone to arrive.

The old connect() body was extracted into _join_room() — same logic,
different name — so presence polling and reconnect can reuse it.
…, env override

LiveKit Cloud has real rate limits and per-minute billing, so 30s between
presence checks is a sensible default. Self-hosted LiveKit has neither
constraint, so 5s keeps the first-speaker wait short.

Detection is by URL: hostname containing ".livekit.cloud" picks the
cloud cadence, anything else picks the local cadence.

LIVEKIT_PRESENCE_POLL_INTERVAL (seconds, float) overrides either default
for users with unusual deployments. The setup wizard now prompts for it
with an "empty = auto" hint.

Resolved once at adapter construction and logged so the operator can
see which cadence is in effect and why.
The earlier commit that added _resolve_presence_poll_interval inserted
the new method's body between __init__'s presence_poll_interval call and
the trailing init lines (_audio_buffers, _last_audio_time, _audio_streams,
_paused, _speaking_participants). The trailing lines ended up *after*
the method's `return interval` and so were never executed.

Symptom: every participant_disconnected event raised
  AttributeError: 'LiveKitAdapter' object has no attribute '_audio_streams'
inside _cleanup_participant. Because the AttributeError aborted the
event handler before the auto-leave check, the room never tore down
when the last human left.

Move the orphaned init lines back into __init__ where they belong, so
the dicts and flags exist on the instance.
_process_voice_input was importing a function that doesn't exist in
tools.transcription_tools, which raised ImportError on every utterance
and silently aborted the voice pipeline (no transcript, no LLM call,
no TTS reply).

transcribe_audio already resolves the model from stt config internally
when called with no model arg — that's the pattern gateway/run.py and
gateway/platforms/discord.py use. Drop the bogus import + helper call
and let transcribe_audio do its thing.
francip added 2 commits May 14, 2026 20:15
The LiveKit adapter only ever joins one room (the configured
LIVEKIT_ROOM), so the "home channel" is unambiguous: it's the room.
Unlike Discord/Telegram/Slack where a bot lives in many channels and
the user has to pick which is "home", LiveKit's single-room architecture
makes that choice trivially singular.

Previously LIVEKIT_HOME_CHANNEL being unset caused the gateway's
first-message onboarding gate to fire, asking the voice user to type
/sethome — which a voice-only user can't meaningfully do. It also left
cron/cross-platform delivery without a default destination even though
one was obviously available (the room).

Default LIVEKIT_HOME_CHANNEL to LIVEKIT_ROOM when unset. Set both the
env var (for the onboarding-gate's os.getenv check) and the
PlatformConfig.home_channel (for runtime delivery resolution). Still
honors an explicit LIVEKIT_HOME_CHANNEL override.
Two bugs blocked end-to-end speech on a freshly-joined LiveKit room:

1. _cleanup_participant dropped the audio buffer the instant the track
   unsubscribed — fine for permanent leaves, but it also fires when a
   participant's mic transiently drops or, for file-based publishers,
   when the clip ends. VAD never reached its silence-threshold trigger
   on the unfinished utterance, so the user's last words were lost.
   Flush the pending utterance (if it meets MIN_SPEECH_DURATION after
   trailing-silence trim) through _process_voice_input before tearing
   down state, so the words still reach STT.

2. voice.auto_tts is False by default — correct for text platforms
   like Discord/Slack where TTS is opt-in, wrong for LiveKit where
   the channel itself is audio. A typed-only reply gives the LiveKit
   user nothing. Override _should_auto_tts_for_chat in LiveKitAdapter
   so it defaults to True; per-chat /voice off via
   _auto_tts_disabled_chats still wins.

End-to-end test: probe joins the room, publishes a 5s utterance,
hears Avery reply with cloud OpenAI TTS through the LiveKit audio
track. Confirmed listening-start, listening-stop, user-transcript,
thinking-start, speaking-start, speaking-stop all fire in order with
real audio frames published back.
@francip francip force-pushed the kortexa/gateway-livekit branch from c903e7b to 3ffc087 Compare May 15, 2026 03:16
@francip francip requested a review from a team May 15, 2026 03:16
Match the 2026-05-12 supply-chain hardening policy on main: replace
``livekit>=1.0.17,<2`` / ``livekit-api>=1.0.7,<2`` with exact pins so
PyPI cannot ship a new release into a user's install without an
intentional bump here.

Versions:
  livekit==1.1.7      released 2026-04-27, no yanks, well past any
                      fresh-release risk window (skipped 1.1.8 which
                      shipped 2026-05-13 — too recent given the
                      Mini Shai-Hulud quarantine policy).
  livekit-api==1.1.0  released 2025-12-02, six months stable. Skipped
                      1.0.8 (yanked: wrong dependencies).
teknium1 pushed a commit that referenced this pull request May 17, 2026
…) hook

Refactor the inlined `re.sub(...)[:4000].strip()` cleanup at the
auto-TTS site in `_process_message_background` into an overridable
method `BasePlatformAdapter.prepare_tts_text(text: str) -> str`.

The default implementation is byte-identical to the previous inline
expression — strip `* _ \` # [ ] ( )` and truncate to 4000 chars — so
every existing adapter (Telegram, Discord, Slack, Matrix, IRC, etc.)
gets exactly the same behaviour as before. Zero behaviour change for
any consumer that doesn't override the method.

Why add the hook: voice-first platform adapters need stricter
cleanup than text-bubble platforms. The default strips a handful of
markdown sigils, which is fine when the output goes into a Discord
embed or a Telegram message bubble — but read aloud by a TTS engine,
URLs (`https://example.com/foo`), fenced code blocks, file paths
(`/Users/x/foo.py`), and `MEDIA:` tags turn into long sequences of
unintelligible characters. With this hook an adapter can drop those
spans before TTS while leaving the data-channel transcript intact
for visual rendering.

Without the hook, voice adapters have to either
  - duplicate the auto-TTS flow inside their own `handle_response`
    pipeline, which means re-implementing the entire `extract_media`,
    `extract_images`, `extract_local_files`, attachment routing and
    error-handling sequence in `_process_message_background`, or
  - live with TTS speaking URLs character-by-character.

Both are worse than a 7-line method addition.

Example consumer:
  https://github.com/kortexa-ai/hermes-livekit — LiveKit WebRTC voice
  gateway plugin. Its `LiveKitAdapter.prepare_tts_text()` additionally
  strips fenced code blocks, inline code, URLs, file paths, and
  `MEDIA:` tags before TTS synthesis, while the full response still
  reaches connected clients via the data channel. Drop-in installable
  via `pip install git+https://github.com/kortexa-ai/hermes-livekit.git`.

Carved out of #3894 (LiveKit WebRTC gateway PR) so the generic hook
can land independently of the LiveKit platform itself.
@austinpickett austinpickett requested a review from Copilot May 19, 2026 12:25

@austinpickett austinpickett left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix merge conflicts, use .github/PULL_REQUEST_TEMPLATE.md, and resolve copilot comments.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new LiveKit gateway platform adapter to enable real-time, voice-first Hermes conversations over WebRTC, and wires it into the gateway/CLI/tooling so it can be configured, authorized, and used alongside existing messaging platforms.

Changes:

  • Introduces gateway/platforms/livekit.py implementing LiveKit room join/presence polling, inbound audio buffering → STT → agent loop, and outbound TTS playback + data-channel text.
  • Adds a BasePlatformAdapter.prepare_tts_text() hook and a LiveKit-specific override to filter spoken TTS content.
  • Extends configuration/CLI integration: new livekit optional dependency extra, setup-wizard auto-install support, status/config platform listing updates, and gateway adapter creation + auth env mapping.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
toolsets.py Adds hermes-livekit toolset and includes it in hermes-gateway.
tools/send_message_tool.py Explicitly blocks send_message routing to LiveKit with a clear error.
pyproject.toml Adds livekit optional-deps group and includes it in all.
hermes_cli/status.py Introduces shared MESSAGING_PLATFORMS registry and adds LiveKit to status display.
hermes_cli/platforms.py Registers LiveKit in the CLI platform list with a default toolset.
hermes_cli/gateway.py Adds LiveKit to setup wizard, adds extras keys for auto-install, and masks password fields when echoing existing values.
hermes_cli/config.py Uses MESSAGING_PLATFORMS to print configured messaging platforms.
gateway/run.py Creates LiveKit adapter and adds LiveKit to authorization env maps.
gateway/platforms/livekit.py New LiveKit adapter implementing voice I/O, STT, TTS, data-channel messaging, and presence-aware join/leave.
gateway/platforms/base.py Adds prepare_tts_text() hook and routes auto-TTS through it.
gateway/config.py Adds Platform.LIVEKIT, adds connected-checker entry, and loads LiveKit config from env into gateway config (including home channel defaulting).
agent/prompt_builder.py Adds LiveKit-specific prompt hint for voice-first, concise responses.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_cli/status.py
"Email": ("EMAIL_ADDRESS", "EMAIL_HOME_ADDRESS"),
"SMS": ("TWILIO_ACCOUNT_SID", "SMS_HOME_CHANNEL"),
"Mattermost": ("MATTERMOST_URL", None),
"Matrix": ("MATRIX_HOMESERVER_URL", None),
Comment thread hermes_cli/status.py
Comment on lines 410 to 427
platforms = {
"Telegram": ("TELEGRAM_BOT_TOKEN", "TELEGRAM_HOME_CHANNEL"),
"Discord": ("DISCORD_BOT_TOKEN", "DISCORD_HOME_CHANNEL"),
"WhatsApp": ("WHATSAPP_ENABLED", None),
"Signal": ("SIGNAL_HTTP_URL", "SIGNAL_HOME_CHANNEL"),
"Slack": ("SLACK_BOT_TOKEN", None),
"Email": ("EMAIL_ADDRESS", "EMAIL_HOME_ADDRESS"),
"SMS": ("TWILIO_ACCOUNT_SID", "SMS_HOME_CHANNEL"),
"DingTalk": ("DINGTALK_CLIENT_ID", None),
"Feishu": ("FEISHU_APP_ID", "FEISHU_HOME_CHANNEL"),
"WeCom": ("WECOM_BOT_ID", "WECOM_HOME_CHANNEL"),
"WeCom Callback": ("WECOM_CALLBACK_CORP_ID", None),
"Weixin": ("WEIXIN_ACCOUNT_ID", "WEIXIN_HOME_CHANNEL"),
"BlueBubbles": ("BLUEBUBBLES_SERVER_URL", "BLUEBUBBLES_HOME_CHANNEL"),
"QQBot": ("QQ_APP_ID", "QQ_HOME_CHANNEL"),
"Yuanbao": ("YUANBAO_APP_ID", "YUANBAO_HOME_CHANNEL"),
"LiveKit": ("LIVEKIT_URL", None),
}
Comment thread hermes_cli/config.py
Comment on lines +4908 to +4909
for name, (token_var, _home_var) in MESSAGING_PLATFORMS.items():
configured = bool(get_env_value(token_var))
Comment thread gateway/config.py
(cfg.extra.get("client_id") or os.getenv("DINGTALK_CLIENT_ID"))
and (cfg.extra.get("client_secret") or os.getenv("DINGTALK_CLIENT_SECRET"))
),
Platform.LIVEKIT: lambda cfg: bool(cfg.extra.get("url")),
Comment thread gateway/run.py
Comment on lines +5381 to +5383
from gateway.platforms.livekit import LiveKitAdapter, check_livekit_requirements
if not check_livekit_requirements():
logger.warning("LiveKit: livekit SDK not installed or LIVEKIT_URL/API_KEY/API_SECRET not set")
Comment on lines +430 to +432
# Initialize buffer for this participant
self._audio_buffers[identity] = bytearray()
self._last_audio_time[identity] = time.monotonic()
Comment on lines +589 to +590
silence_bytes = int(SILENCE_THRESHOLD_SECONDS * SAMPLE_RATE * NUM_CHANNELS * 2)
speech_end = max(0, len(buf) - silence_bytes)
Comment thread gateway/config.py
Comment on lines +1821 to +1828
config.platforms[Platform.LIVEKIT].extra.update({
"url": livekit_url,
"api_key": livekit_api_key,
"api_secret": livekit_api_secret,
"room": livekit_room,
"agent_name": os.getenv("LIVEKIT_AGENT_NAME", "Hermes"),
"agent_avatar": os.getenv("LIVEKIT_AGENT_AVATAR", ""),
})
@francip

francip commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

Closing this in favor of the hermes-livekit pluginhttps://github.com/kortexa-ai/hermes-livekit.

The plugin packages the same LiveKit adapter as a pip-installable hermes_agent.plugins entry point, so it needs no core patches: it registers the platform via ctx.register_platform() instead of editing gateway/platforms/, hermes_cli/status.py, hermes_cli/config.py, etc.

Why we're going plugin-only:

  • No rebase treadmill. This branch had drifted ~800 commits behind main. The plugin sits on top of upstream main untouched.
  • It's already ahead. The plugin is at v0.3.0 with a remote-tools protocol (clients register tools the agent can invoke over the data channel) that this PR never had.
  • Half the review feedback evaporates. The Copilot comments about MESSAGING_PLATFORMS registry divergence and hermes status/hermes config consistency were artifacts of patching core CLI files — the plugin doesn't touch them.

The two genuine adapter bugs Copilot flagged (eager _last_audio_time init in _on_track_subscribed → STT-on-silence risk; unconditional silence_bytes subtraction in _cleanup_participant → dropped final utterance) are being fixed in the plugin's adapter.

Thanks for the reviews — they carried over usefully.

@francip francip closed this May 21, 2026
francip added a commit to kortexa-ai/hermes-livekit that referenced this pull request May 21, 2026
…ce (0.3.1)

Two voice-path bugs surfaced by review feedback on the now-closed core
PR NousResearch/hermes-agent#3894:

- _on_track_subscribed seeded _last_audio_time on subscribe, defeating
  the _check_silence_loop guard that treats a missing entry as "never
  spoke" and discards accumulated noise. A participant publishing only
  silence would accrue a stale timestamp and eventually trip STT on
  silence. The timestamp is now set only on the first chunk above the
  RMS floor.

- The track-end utterance flush in _cleanup_participant computed
  speech_end = max(0, len(buf) - silence_bytes), unconditionally
  trimming a fixed silence window. When a track ends right after a word
  with no trailing silence buffered, that chops real speech or zeroes
  the flush. On track end the flush now transcribes the whole buffer —
  trailing silence handed to STT is harmless, lost words are not.

Also corrects a stale prepare_tts_text docstring: upstream
BasePlatformAdapter now calls the hook (landed via
NousResearch/hermes-agent#27308), so the override is live.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
…) hook

Refactor the inlined `re.sub(...)[:4000].strip()` cleanup at the
auto-TTS site in `_process_message_background` into an overridable
method `BasePlatformAdapter.prepare_tts_text(text: str) -> str`.

The default implementation is byte-identical to the previous inline
expression — strip `* _ \` # [ ] ( )` and truncate to 4000 chars — so
every existing adapter (Telegram, Discord, Slack, Matrix, IRC, etc.)
gets exactly the same behaviour as before. Zero behaviour change for
any consumer that doesn't override the method.

Why add the hook: voice-first platform adapters need stricter
cleanup than text-bubble platforms. The default strips a handful of
markdown sigils, which is fine when the output goes into a Discord
embed or a Telegram message bubble — but read aloud by a TTS engine,
URLs (`https://example.com/foo`), fenced code blocks, file paths
(`/Users/x/foo.py`), and `MEDIA:` tags turn into long sequences of
unintelligible characters. With this hook an adapter can drop those
spans before TTS while leaving the data-channel transcript intact
for visual rendering.

Without the hook, voice adapters have to either
  - duplicate the auto-TTS flow inside their own `handle_response`
    pipeline, which means re-implementing the entire `extract_media`,
    `extract_images`, `extract_local_files`, attachment routing and
    error-handling sequence in `_process_message_background`, or
  - live with TTS speaking URLs character-by-character.

Both are worse than a 7-line method addition.

Example consumer:
  https://github.com/kortexa-ai/hermes-livekit — LiveKit WebRTC voice
  gateway plugin. Its `LiveKitAdapter.prepare_tts_text()` additionally
  strips fenced code blocks, inline code, URLs, file paths, and
  `MEDIA:` tags before TTS synthesis, while the full response still
  reaches connected clients via the data channel. Drop-in installable
  via `pip install git+https://github.com/kortexa-ai/hermes-livekit.git`.

Carved out of NousResearch#3894 (LiveKit WebRTC gateway PR) so the generic hook
can land independently of the LiveKit platform itself.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery tool/tts Text-to-speech and transcription type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants