feat(tts): add Google Gemini TTS provider by teknium1 · Pull Request #11229 · NousResearch/hermes-agent

teknium1 · 2026-04-16T20:59:40Z

Summary

Adds Google Gemini TTS as the seventh voice provider in the TTS tool — driven by the Shubham Saboo / OpenClaw mention. 30 prebuilt voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) with natural-language prompt control (say cheerfully:, inline [whispers] tags).

Integrates cleanly through the existing provider chain — no new SDK dep, uses raw REST like xAI/MiniMax.

What changed

File	Change
`tools/tts_tool.py`	New `_generate_gemini_tts()` + `_wrap_pcm_as_wav()`; routed in main dispatcher; `check_tts_requirements()` accepts `GEMINI_API_KEY` / `GOOGLE_API_KEY`
`hermes_cli/tools_config.py`	'Google Gemini TTS' entry added to the `hermes tools` TTS picker
`hermes_cli/setup.py`	Wizard picker, status display, and API-key prompt branch
`tests/tools/test_tts_gemini.py`	15 unit tests (WAV header, env fallback, voice/model overrides, snake_case inlineData, HTTP error surfacing, etc.)
`website/docs/user-guide/features/tts.md`	Provider table, config example, ffmpeg notes

Design notes

REST over SDK. No google-genai dependency added — mirrors xAI/MiniMax raw-request pattern. Keeps the install footprint small.
PCM → WAV wrap → ffmpeg. Gemini returns raw L16 PCM @ 24kHz mono 16-bit (no container). A 44-byte WAV RIFF header is prepended, then ffmpeg encodes to MP3 / Opus depending on the output extension.
Telegram-compatible Opus. For .ogg output we explicitly pass -acodec libopus (ffmpeg defaults to Vorbis for .ogg, which Telegram doesn't show as a voice bubble). Same -b:a 64k -ac 1 settings as the existing _convert_to_opus helper.
Key fallback. Accepts either GEMINI_API_KEY (primary) or GOOGLE_API_KEY (same key, different env name).
New key format tolerance. Google has rolled out a new key format (AQ.Ab8R… instead of AIza…); both work transparently against /v1beta/generateContent.

Test plan

Unit tests: 15 new tests in tests/tools/test_tts_gemini.py, all passing. Covers WAV header structure, default and custom voice/model, response modality, snake_case vs camelCase inlineData, HTTP error surfacing, empty/malformed responses, and GOOGLE_API_KEY fallback.

Live E2E against gemini-2.5-flash-preview-tts with a real API key:

Output	Size	ffprobe codec	Duration	Notes
`.wav`	167KB	pcm_s16le, 24kHz, mono	3.5s	Fast path, no ffmpeg
`.mp3`	12KB	mp3, 24kHz, mono	3.1s	CLI default
`.ogg`	17KB	opus, 48kHz, mono	2.1s	Telegram voice-bubble compatible

Also tested custom voice (Puck), missing key → ValueError, and full-dispatcher integration via text_to_speech_tool() with HERMES_SESSION_PLATFORM=telegram → voice_compatible=True and [[audio_as_voice]] marker.

Existing test files unaffected: test_tts_mistral.py, test_tts_speed.py, test_voice_cli_integration.py, test_setup.py, test_tools_config.py — all 150 tests pass.

Usage

hermes tools                   # pick 'Google Gemini TTS' under Text-to-Speech
# or
hermes setup tts               # wizard prompts for GEMINI_API_KEY

# ~/.hermes/config.yaml
tts:
  provider: gemini
  gemini:
    model: gemini-2.5-flash-preview-tts
    voice: Kore  # Zephyr, Puck, Kore, Enceladus, Gacrux, etc.

Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language prompt control. Integrates through the existing provider chain: - tools/tts_tool.py: new _generate_gemini_tts() calls the generativelanguage REST endpoint with responseModalities=[AUDIO], wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header, then ffmpeg-converts to MP3 or Opus depending on output extension. For .ogg output, libopus is forced explicitly so Telegram voice bubbles get Opus (ffmpeg defaults to Vorbis for .ogg). - hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider option in the curses-based 'hermes tools' UI. - hermes_cli/setup.py: adds gemini to the setup wizard picker, tool status display, and API key prompt branch (accepts existing GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set). - tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model overrides, snake_case vs camelCase inlineData handling, HTTP error surfacing, and empty-audio edge cases. - docs: TTS features page updated to list seven providers with the new gemini config block and ffmpeg notes. Live-tested against api key against gemini-2.5-flash-preview-tts: .wav, .mp3, and Telegram-compatible .ogg (Opus codec) all produce valid playable audio.

github-actions · 2026-04-16T20:59:54Z

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: base64 encoding/decoding detected

Base64 has legitimate uses (images, JWT, etc.) but is also commonly used to obfuscate malicious payloads. Verify the usage is appropriate.

Matches (first 20):

179:+                                "data": base64.b64encode(fake_pcm_bytes).decode(),
370:+                                    "data": base64.b64encode(fake_pcm_bytes).decode()
580:+    pcm_bytes = base64.b64decode(audio_b64)

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

548:+    response = requests.post(

⚠️ WARNING: Install hook files modified

These files can execute code during package installation or interpreter startup.

Files:

hermes_cli/setup.py

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

Adds Google Gemini TTS as the seventh voice provider, with 30 prebuilt voices (Zephyr, Puck, Kore, Enceladus, Gacrux, etc.) and natural-language prompt control. Integrates through the existing provider chain: - tools/tts_tool.py: new _generate_gemini_tts() calls the generativelanguage REST endpoint with responseModalities=[AUDIO], wraps the returned 24kHz mono 16-bit PCM (L16) in a WAV RIFF header, then ffmpeg-converts to MP3 or Opus depending on output extension. For .ogg output, libopus is forced explicitly so Telegram voice bubbles get Opus (ffmpeg defaults to Vorbis for .ogg). - hermes_cli/tools_config.py: exposes 'Google Gemini TTS' as a provider option in the curses-based 'hermes tools' UI. - hermes_cli/setup.py: adds gemini to the setup wizard picker, tool status display, and API key prompt branch (accepts existing GEMINI_API_KEY or GOOGLE_API_KEY, falls back to Edge if neither set). - tests/tools/test_tts_gemini.py: 15 unit tests covering WAV header wrap correctness, env var fallback (GEMINI/GOOGLE), voice/model overrides, snake_case vs camelCase inlineData handling, HTTP error surfacing, and empty-audio edge cases. - docs: TTS features page updated to list seven providers with the new gemini config block and ffmpeg notes. Live-tested against api key against gemini-2.5-flash-preview-tts: .wav, .mp3, and Telegram-compatible .ogg (Opus codec) all produce valid playable audio.

teknium1 merged commit fce6c3c into main Apr 16, 2026
6 of 8 checks passed

teknium1 deleted the hermes/hermes-9ddfec55 branch April 16, 2026 21:23

This was referenced Apr 16, 2026

feat(tts): add Gemini TTS provider #11091

Closed

Feature: Add Google Gemini TTS as a speech-generation provider #10918

Closed

docs: backfill coverage for recently-merged features #11942

Merged

github-actions Bot mentioned this pull request Apr 24, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.16 to v2026.4.23 Docker-Hub-sirmark/docker-hermes-agent#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tts): add Google Gemini TTS provider#11229

feat(tts): add Google Gemini TTS provider#11229
teknium1 merged 1 commit into
mainfrom
hermes/hermes-9ddfec55

teknium1 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Apr 16, 2026

Summary

What changed

Design notes

Test plan

Usage

Uh oh!

github-actions Bot commented Apr 16, 2026

⚠️ Supply Chain Risk Detected

⚠️ WARNING: base64 encoding/decoding detected

⚠️ WARNING: Outbound network calls (POST/PUT)

⚠️ WARNING: Install hook files modified

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant