Skip to content

feat(audio): add OpenRouter as a transcription + speech provider#21799

Open
saurabh wants to merge 2 commits into
NousResearch:mainfrom
saurabh:feat/stt-openrouter-provider
Open

feat(audio): add OpenRouter as a transcription + speech provider#21799
saurabh wants to merge 2 commits into
NousResearch:mainfrom
saurabh:feat/stt-openrouter-provider

Conversation

@saurabh

@saurabh saurabh commented May 8, 2026

Copy link
Copy Markdown

Summary

Adds openrouter as a unified provider for both transcription (STT) and speech synthesis (TTS), alongside the existing local/groq/openai/mistral/xai (STT) and edge/elevenlabs/openai/minimax/mistral/gemini/xai/neutts/kittentts/piper (TTS) sets.

OpenRouter announced their audio APIs on May 1, 2026, exposing both /v1/audio/transcriptions and /v1/audio/speech that route across providers (OpenAI Whisper, GPT-4o-transcribe, Google Chirp 3, Groq Whisper for STT; OpenAI TTS, Google Gemini Flash TTS, Mistral Voxtral Mini TTS for speech) under one API key.

"Text-to-speech and transcription are now live on OpenRouter. Two new endpoints give you access to speech synthesis and audio transcription across multiple providers, under one API."

Why this is useful

  • One key (OPENROUTER_API_KEY) covers chat + STT + TTS for users already on OpenRouter for LLMs.
  • OpenRouter handles fallback across providers automatically.
  • Lets users try premium audio models (openai/gpt-4o-mini-tts, google/chirp-3, etc.) without separate accounts.

STT — tools/transcription_tools.py

OR's transcription endpoint takes JSON body with base64-encoded audio (not OpenAI-style multipart), so it can't reuse the existing openai client path.

  • New _transcribe_openrouter() — JSON body with input_audio.data (base64) + format string.
  • Format whitelist (wav/mp3/flac/m4a/ogg/webm/aac); rejects unsupported extensions early. Aliases: .oga/.opusogg, .mpegmp3.
  • 25 MiB size cap matching the documented limit.
  • Typed exception branches for Timeout / ConnectionError so transient failures surface cleanly.
  • _get_provider() — explicit openrouter routing + auto-detect entry that ranks below local/groq and above openai.
  • tools/voice_mode.pyvoice_doctor now reports the openrouter / mistral / xai branches.

TTS — tools/tts_tool.py

OR's speech endpoint is OpenAI-shape JSON, so this provider reuses the OpenAI client with a swapped base_url + key.

  • New _generate_openrouter_tts() — uses the OpenAI SDK against OR's base URL, default model openai/gpt-4o-mini-tts-2025-12-15.
  • response_format is hard-coded to mp3 — OR only accepts mp3 / pcm. Telegram-bound .ogg outputs go through the existing ffmpeg-to-Opus path (same as Edge TTS).
  • BUILTIN_TTS_PROVIDERS includes openrouter so a user's tts.providers.openrouter command block can never shadow it.
  • PROVIDER_MAX_TEXT_LENGTH["openrouter"] = 4096 (follows the underlying OpenAI cap).
  • Configurable base_url via tts.openrouter.base_url or TTS_OPENROUTER_BASE_URL.

Tests

  • 19 STT tests + 13 TTS tests = 32 new tests total. All pass.
  • Full suite (pytest tests/tools/test_transcription_tools.py tests/tools/test_tts_*.py): 265 passed, 7 skipped.

Live verification

  • Direct curl to OR's /v1/audio/transcriptions with a real .wav → correct transcript + usage metadata.
  • Direct curl to OR's /v1/audio/speech with model openai/gpt-4o-mini-tts-2025-12-15 → valid audio/mpeg.
  • _transcribe_openrouter() end-to-end → {"success": true, "transcript": "...", "provider": "openrouter"}.
  • _generate_openrouter_tts() end-to-end → 75 KiB MP3.
  • Full TTS dispatcher with HERMES_SESSION_PLATFORM=telegram → valid Opus OGG (24 kHz mono) via ffmpeg conversion, marked voice_compatible: true.

(A bug was caught during live verification: my first TTS impl asked for response_format: opus when the output path was .ogg, which OR rejects with ZodError. Fixed by always requesting mp3 and routing through the existing ffmpeg conversion path.)

Test plan

  • Unit tests pass (pytest tests/tools/test_transcription_tools.py tests/tools/test_tts_openrouter.py)
  • STT: manually verified against the live /v1/audio/transcriptions endpoint
  • TTS: manually verified against the live /v1/audio/speech endpoint, including the Telegram Opus path
  • Reviewer can verify config: set stt.provider: openrouter + tts.provider: openrouter + OPENROUTER_API_KEY and exchange a voice note

Adds `openrouter` to the STT provider list, alongside local/groq/openai/
mistral/xai. OpenRouter exposes /v1/audio/transcriptions with a
JSON+base64 protocol (not OpenAI-multipart), so it needs its own client
path rather than re-using the openai-compatible code.

Why this is useful
- One key (OPENROUTER_API_KEY) covers chat + STT, which simplifies
  configuration for users already on OpenRouter for LLMs.
- OpenRouter routes requests across providers (Groq Whisper, OpenAI
  Whisper, etc.), giving fallback for free.

Provider selection
- Explicit: stt.provider: openrouter (or STT_PROVIDER=openrouter).
- Auto-detect: chosen when local + groq are unavailable but
  OPENROUTER_API_KEY is set; ranks below local/groq, above openai.

Implementation notes
- _transcribe_openrouter() in tools/transcription_tools.py:
  - JSON body with base64-encoded input_audio.data + format string.
  - Format whitelist (wav/mp3/flac/m4a/ogg/webm/aac); rejects unsupported
    extensions early, with .oga/.opus → ogg and .mpeg → mp3 aliases.
  - 25 MiB size cap matching the documented limit.
  - Typed exception branches for Timeout / ConnectionError so transient
    failures surface cleanly instead of as tracebacks.
- voice_mode.py: voice_doctor now reports the openrouter/mistral/xai
  branches (previously only listed local/groq/openai).
- cli-config.yaml.example: documents the new provider + model + optional
  base_url override.

Tests
- 19 new tests in TestTranscribeOpenRouter / TestGetProviderOpenRouter /
  TestTranscribeAudioOpenRouterDispatch covering: missing key, success,
  whitespace/empty handling, HTTP errors, permission errors, format
  whitelist + aliases, JSON-not-multipart wire format, base_url override,
  header shape, connection errors, and provider auto-detect ranking.
- pytest tests/tools/test_transcription_tools.py: 106 passed, 7 skipped.
@alt-glitch alt-glitch added type/feature New feature or request provider/openrouter OpenRouter aggregator tool/tts Text-to-speech and transcription P3 Low — cosmetic, nice to have labels May 8, 2026
Adds `openrouter` to the TTS provider list, alongside edge/elevenlabs/
openai/minimax/mistral/gemini/xai/neutts/kittentts/piper. Mirrors the
STT provider added in the prior commit on this branch — same
OPENROUTER_API_KEY covers chat + STT + TTS.

OpenRouter exposes /v1/audio/speech with the OpenAI request shape
({model, voice, input, response_format, speed}), so this provider
reuses the OpenAI client with a swapped base_url + key. The model slug
selects the underlying provider (openai/google/mistral/...).

Implementation
- _generate_openrouter_tts() in tools/tts_tool.py: OpenAI-client based,
  default model openai/gpt-4o-mini-tts-2025-12-15 (the slug OR returns
  for the OpenAI route as of the May 2026 audio API launch).
- response_format is hard-coded to "mp3" — OR's endpoint only accepts
  mp3 / pcm. Telegram-bound .ogg outputs go through the existing ffmpeg
  conversion path (same path Edge TTS takes).
- openrouter is in BUILTIN_TTS_PROVIDERS so a user's
  tts.providers.openrouter command block can never shadow it.
- PROVIDER_MAX_TEXT_LENGTH["openrouter"] = 4096 (follows the underlying
  OpenAI cap).
- Configurable base_url via tts.openrouter.base_url or
  TTS_OPENROUTER_BASE_URL env (matches the xAI / OpenAI patterns).

Tests
- 13 new tests in test_tts_openrouter.py covering missing key, success,
  default + custom model/voice, default + config + env base_url, the
  always-mp3 invariant (via .ogg output_path), speed clamp + omitted
  default, dispatcher routing, BUILTIN_TTS_PROVIDERS membership, and
  PROVIDER_MAX_TEXT_LENGTH.
- pytest tests/tools/test_tts_openrouter.py: 13 passed.
- Full TTS + STT suite: 265 passed, 7 skipped (one collection error in
  test_tts_kittentts is a local-only missing-numpy issue, unrelated).

Live verification
- Direct curl against POST https://openrouter.ai/api/v1/audio/speech
  with model openai/gpt-4o-mini-tts-2025-12-15 — returns audio/mpeg.
- _generate_openrouter_tts() against the live endpoint — produces a
  valid 75 KiB MP3.
- Full dispatcher via text_to_speech_tool with HERMES_SESSION_PLATFORM=
  telegram — produces a valid Opus OGG (24kHz mono) via ffmpeg.

Docs
- website/docs/user-guide/features/tts.md: added the provider row,
  config block, and per-request input cap entry.
@saurabh saurabh changed the title feat(stt): add OpenRouter as a transcription provider feat(audio): add OpenRouter as a transcription + speech provider May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have provider/openrouter OpenRouter aggregator tool/tts Text-to-speech and transcription type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants