Skip to content

fix(tts): rename .ogg->mp3 before opus conversion for mp3-only providers (edge-tts, minimax, xai)#20882

Open
Warrenpoobear wants to merge 2 commits into
NousResearch:mainfrom
Warrenpoobear:fix/pr-20878-edge-tts-ogg-path
Open

fix(tts): rename .ogg->mp3 before opus conversion for mp3-only providers (edge-tts, minimax, xai)#20882
Warrenpoobear wants to merge 2 commits into
NousResearch:mainfrom
Warrenpoobear:fix/pr-20878-edge-tts-ogg-path

Conversation

@Warrenpoobear

Copy link
Copy Markdown

What this fixes\n\nPR #20878 (fix(tts): use .ogg extension for Telegram auto-TTS voice replies) introduces platform-aware extension selection in _send_voice_reply(). The change is correct for ElevenLabs/OpenAI/Mistral/Gemini — those providers honour the .ogg extension and output native Opus.\n\nBug introduced by #20878 (mp3-only providers — edge-tts, minimax, xai, neutts, kittentts, piper):\n\n_generate_edge_tts() (and the other mp3-only providers) call communicate.save(output_path) unconditionally — they write raw MP3 bytes regardless of the path extension. After generation the existing opus-conversion block was:\n\npython\nelif provider in ("edge", "neutts", ...) and not file_str.endswith(".ogg"):\n opus_path = _convert_to_opus(file_str)\n\n\nWith a .ogg path the guard not file_str.endswith(".ogg") evaluates to False, so _convert_to_opus is skipped entirely. The result is a .ogg file containing MP3 bytes — Telegram rejects it and the voice bubble silently fails.\n\n## Fix\n\nBefore calling _convert_to_opus, rename the mislabeled file from .ogg.mp3 so ffmpeg receives a correctly-named source. The intermediate .mp3 is cleaned up immediately after conversion. The original .mp3 path is unaffected (rename branch only reached when caller requested .ogg).\n\n## Verification\n\n- Non-Telegram path (.mp3): rename block not entered, no behaviour change.\n- Telegram + ElevenLabs/OpenAI: voice_compatible branch untouched, still correct.\n- Telegram + edge-tts: .ogg renamed to .mp3, _convert_to_opus produces real Opus, voice_compatible = True, Telegram renders voice bubble.\n\nCompanion fix to #20878 — handles the mp3-only provider edge case that #20878 leaves broken.\n

tarekskr and others added 2 commits May 6, 2026 23:32
The gateway's _send_voice_reply() hardcoded .mp3 as the output path
extension, which caused ElevenLabs and OpenAI TTS to output mp3 format
even on Telegram. Telegram requires Opus/OGG for native voice bubbles —
mp3 files are sent as audio file attachments instead.

Now detects the platform from session context and uses .ogg for Telegram,
.mp3 for everything else. The TTS tool already checks the extension to
select the appropriate codec (opus_48000_64 vs mp3_44100_128).
…oviders

When _send_voice_reply() passes a .ogg path (new in PR NousResearch#20878) to
text_to_speech_tool, mp3-only providers like edge-tts write raw MP3 bytes
into the .ogg-named file.  The pre-existing opus-conversion guard

    elif provider in (edge, ...) and not file_str.endswith(.ogg):

evaluated to False (path ends in .ogg), so _convert_to_opus was skipped,
leaving a .ogg file containing MP3 bytes.  Telegram then received a
corrupted audio file that couldn't play.

Fix: remove the .ogg guard; instead rename the mislabeled file to .mp3
before calling _convert_to_opus, then clean up the intermediate .mp3.

Non-Telegram paths (file_str ends in .mp3) are unaffected — the rename
block is only reached when the caller explicitly requested .ogg.
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery tool/tts Text-to-speech and transcription platform/telegram Telegram bot adapter labels May 6, 2026
@achhabra2

Copy link
Copy Markdown

I dug into this locally with Telegram auto-TTS and I think there’s a slightly cleaner fix than forcing .ogg from the gateway caller.

Current root causes I found:

  1. GatewayRunner._send_voice_reply() forces output_path=...mp3, so text_to_speech_tool() cannot use its Telegram-aware default output selection. This causes Telegram replies to be delivered as MP3/audio attachments instead of native voice bubbles.

  2. There is a second path in BasePlatformAdapter._process_message_background() where auto-TTS runs after GatewayRunner has already cleared the session context. In that path, HERMES_SESSION_PLATFORM / get_session_env("HERMES_SESSION_PLATFORM") is blank, so the TTS tool again defaults to MP3 even though the source platform is Telegram.

I tested this by adding regression coverage:

  • against origin/main, the tests fail:
    • _send_voice_reply() passes an explicit .mp3 output path
    • base adapter auto-TTS sees platform "" instead of "telegram"
  • with the fix, they pass.

Suggested approach:

  • In GatewayRunner._send_voice_reply(), do not pass an explicit output_path; call:

    text_to_speech_tool(text=tts_text)

    so the TTS tool can choose .ogg/Opus for Telegram-capable providers and keep normal defaults elsewhere.

  • In BasePlatformAdapter._process_message_background(), re-establish session context around the text_to_speech_tool() call using set_session_vars(...) / clear_session_vars(...), because this auto-TTS path runs after the runner’s handler context has been cleared.

This avoids hardcoding .ogg at the gateway layer and should also avoid the mp3-only-provider edge case that this PR is handling: the TTS tool remains responsible for provider-specific behavior and Opus conversion.

I can open a PR with the two regression tests and the small implementation change if that would be useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/telegram Telegram bot adapter tool/tts Text-to-speech and transcription type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants