Skip to content

feat(tts): add Google Gemini TTS provider#11229

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-9ddfec55
Apr 16, 2026
Merged

feat(tts): add Google Gemini TTS provider#11229
teknium1 merged 1 commit into
mainfrom
hermes/hermes-9ddfec55

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Adds Google Gemini TTS as the seventh voice provider in the TTS tool — driven by the Shubham Saboo / OpenClaw mention. 30 prebuilt voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) with natural-language prompt control (say cheerfully:, inline [whispers] tags).

Integrates cleanly through the existing provider chain — no new SDK dep, uses raw REST like xAI/MiniMax.

What changed

File Change
tools/tts_tool.py New _generate_gemini_tts() + _wrap_pcm_as_wav(); routed in main dispatcher; check_tts_requirements() accepts GEMINI_API_KEY / GOOGLE_API_KEY
hermes_cli/tools_config.py 'Google Gemini TTS' entry added to the hermes tools TTS picker
hermes_cli/setup.py Wizard picker, status display, and API-key prompt branch
tests/tools/test_tts_gemini.py 15 unit tests (WAV header, env fallback, voice/model overrides, snake_case inlineData, HTTP error surfacing, etc.)
website/docs/user-guide/features/tts.md Provider table, config example, ffmpeg notes

Design notes

  • REST over SDK. No google-genai dependency added — mirrors xAI/MiniMax raw-request pattern. Keeps the install footprint small.
  • PCM → WAV wrap → ffmpeg. Gemini returns raw L16 PCM @ 24kHz mono 16-bit (no container). A 44-byte WAV RIFF header is prepended, then ffmpeg encodes to MP3 / Opus depending on the output extension.
  • Telegram-compatible Opus. For .ogg output we explicitly pass -acodec libopus (ffmpeg defaults to Vorbis for .ogg, which Telegram doesn't show as a voice bubble). Same -b:a 64k -ac 1 settings as the existing _convert_to_opus helper.
  • Key fallback. Accepts either GEMINI_API_KEY (primary) or GOOGLE_API_KEY (same key, different env name).
  • New key format tolerance. Google has rolled out a new key format (AQ.Ab8R… instead of AIza…); both work transparently against /v1beta/generateContent.

Test plan

Unit tests: 15 new tests in tests/tools/test_tts_gemini.py, all passing. Covers WAV header structure, default and custom voice/model, response modality, snake_case vs camelCase inlineData, HTTP error surfacing, empty/malformed responses, and GOOGLE_API_KEY fallback.

Live E2E against gemini-2.5-flash-preview-tts with a real API key:

Output Size ffprobe codec Duration Notes
.wav 167KB pcm_s16le, 24kHz, mono 3.5s Fast path, no ffmpeg
.mp3 12KB mp3, 24kHz, mono 3.1s CLI default
.ogg 17KB opus, 48kHz, mono 2.1s Telegram voice-bubble compatible

Also tested custom voice (Puck), missing key → ValueError, and full-dispatcher integration via text_to_speech_tool() with HERMES_SESSION_PLATFORM=telegramvoice_compatible=True and [[audio_as_voice]] marker.

Existing test files unaffected: test_tts_mistral.py, test_tts_speed.py, test_voice_cli_integration.py, test_setup.py, test_tools_config.py — all 150 tests pass.

Usage

hermes tools                   # pick 'Google Gemini TTS' under Text-to-Speech
# or
hermes setup tts               # wizard prompts for GEMINI_API_KEY
# ~/.hermes/config.yaml
tts:
  provider: gemini
  gemini:
    model: gemini-2.5-flash-preview-tts
    voice: Kore  # Zephyr, Puck, Kore, Enceladus, Gacrux, etc.

Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt
voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language
prompt control. Integrates through the existing provider chain:

- tools/tts_tool.py: new _generate_gemini_tts() calls the
  generativelanguage REST endpoint with responseModalities=[AUDIO],
  wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header,
  then ffmpeg-converts to MP3 or Opus depending on output extension.
  For .ogg output, libopus is forced explicitly so Telegram voice
  bubbles get Opus (ffmpeg defaults to Vorbis for .ogg).
- hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider
  option in the curses-based 'hermes tools' UI.
- hermes_cli/setup.py: adds gemini to the setup wizard picker, tool
  status display, and API key prompt branch (accepts existing
  GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set).
- tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header
  wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model
  overrides, snake_case vs camelCase inlineData handling, HTTP error
  surfacing, and empty-audio edge cases.
- docs: TTS features page updated to list seven providers with the new
  gemini config block and ffmpeg notes.

Live-tested against api key against gemini-2.5-flash-preview-tts: .wav,
.mp3, and Telegram-compatible .ogg (Opus codec) all produce valid
playable audio.
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: base64 encoding/decoding detected

Base64 has legitimate uses (images, JWT, etc.) but is also commonly used to obfuscate malicious payloads. Verify the usage is appropriate.

Matches (first 20):

179:+                                "data": base64.b64encode(fake_pcm_bytes).decode(),
370:+                                    "data": base64.b64encode(fake_pcm_bytes).decode()
580:+    pcm_bytes = base64.b64decode(audio_b64)

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

548:+    response = requests.post(

⚠️ WARNING: Install hook files modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

@teknium1 teknium1 merged commit fce6c3c into main Apr 16, 2026
6 of 8 checks passed
@teknium1 teknium1 deleted the hermes/hermes-9ddfec55 branch April 16, 2026 21:23
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt
voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language
prompt control. Integrates through the existing provider chain:

- tools/tts_tool.py: new _generate_gemini_tts() calls the
  generativelanguage REST endpoint with responseModalities=[AUDIO],
  wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header,
  then ffmpeg-converts to MP3 or Opus depending on output extension.
  For .ogg output, libopus is forced explicitly so Telegram voice
  bubbles get Opus (ffmpeg defaults to Vorbis for .ogg).
- hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider
  option in the curses-based 'hermes tools' UI.
- hermes_cli/setup.py: adds gemini to the setup wizard picker, tool
  status display, and API key prompt branch (accepts existing
  GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set).
- tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header
  wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model
  overrides, snake_case vs camelCase inlineData handling, HTTP error
  surfacing, and empty-audio edge cases.
- docs: TTS features page updated to list seven providers with the new
  gemini config block and ffmpeg notes.

Live-tested against api key against gemini-2.5-flash-preview-tts: .wav,
.mp3, and Telegram-compatible .ogg (Opus codec) all produce valid
playable audio.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt
voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language
prompt control. Integrates through the existing provider chain:

- tools/tts_tool.py: new _generate_gemini_tts() calls the
  generativelanguage REST endpoint with responseModalities=[AUDIO],
  wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header,
  then ffmpeg-converts to MP3 or Opus depending on output extension.
  For .ogg output, libopus is forced explicitly so Telegram voice
  bubbles get Opus (ffmpeg defaults to Vorbis for .ogg).
- hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider
  option in the curses-based 'hermes tools' UI.
- hermes_cli/setup.py: adds gemini to the setup wizard picker, tool
  status display, and API key prompt branch (accepts existing
  GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set).
- tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header
  wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model
  overrides, snake_case vs camelCase inlineData handling, HTTP error
  surfacing, and empty-audio edge cases.
- docs: TTS features page updated to list seven providers with the new
  gemini config block and ffmpeg notes.

Live-tested against api key against gemini-2.5-flash-preview-tts: .wav,
.mp3, and Telegram-compatible .ogg (Opus codec) all produce valid
playable audio.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt
voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language
prompt control. Integrates through the existing provider chain:

- tools/tts_tool.py: new _generate_gemini_tts() calls the
  generativelanguage REST endpoint with responseModalities=[AUDIO],
  wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header,
  then ffmpeg-converts to MP3 or Opus depending on output extension.
  For .ogg output, libopus is forced explicitly so Telegram voice
  bubbles get Opus (ffmpeg defaults to Vorbis for .ogg).
- hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider
  option in the curses-based 'hermes tools' UI.
- hermes_cli/setup.py: adds gemini to the setup wizard picker, tool
  status display, and API key prompt branch (accepts existing
  GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set).
- tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header
  wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model
  overrides, snake_case vs camelCase inlineData handling, HTTP error
  surfacing, and empty-audio edge cases.
- docs: TTS features page updated to list seven providers with the new
  gemini config block and ffmpeg notes.

Live-tested against api key against gemini-2.5-flash-preview-tts: .wav,
.mp3, and Telegram-compatible .ogg (Opus codec) all produce valid
playable audio.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt
voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language
prompt control. Integrates through the existing provider chain:

- tools/tts_tool.py: new _generate_gemini_tts() calls the
  generativelanguage REST endpoint with responseModalities=[AUDIO],
  wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header,
  then ffmpeg-converts to MP3 or Opus depending on output extension.
  For .ogg output, libopus is forced explicitly so Telegram voice
  bubbles get Opus (ffmpeg defaults to Vorbis for .ogg).
- hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider
  option in the curses-based 'hermes tools' UI.
- hermes_cli/setup.py: adds gemini to the setup wizard picker, tool
  status display, and API key prompt branch (accepts existing
  GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set).
- tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header
  wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model
  overrides, snake_case vs camelCase inlineData handling, HTTP error
  surfacing, and empty-audio edge cases.
- docs: TTS features page updated to list seven providers with the new
  gemini config block and ffmpeg notes.

Live-tested against api key against gemini-2.5-flash-preview-tts: .wav,
.mp3, and Telegram-compatible .ogg (Opus codec) all produce valid
playable audio.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt
voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language
prompt control. Integrates through the existing provider chain:

- tools/tts_tool.py: new _generate_gemini_tts() calls the
  generativelanguage REST endpoint with responseModalities=[AUDIO],
  wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header,
  then ffmpeg-converts to MP3 or Opus depending on output extension.
  For .ogg output, libopus is forced explicitly so Telegram voice
  bubbles get Opus (ffmpeg defaults to Vorbis for .ogg).
- hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider
  option in the curses-based 'hermes tools' UI.
- hermes_cli/setup.py: adds gemini to the setup wizard picker, tool
  status display, and API key prompt branch (accepts existing
  GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set).
- tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header
  wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model
  overrides, snake_case vs camelCase inlineData handling, HTTP error
  surfacing, and empty-audio edge cases.
- docs: TTS features page updated to list seven providers with the new
  gemini config block and ffmpeg notes.

Live-tested against api key against gemini-2.5-flash-preview-tts: .wav,
.mp3, and Telegram-compatible .ogg (Opus codec) all produce valid
playable audio.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant