Skip to content

Add voice message transcription (Mistral Voxtral + OpenAI Whisper)#106

Merged
RichardAtCT merged 8 commits intoRichardAtCT:mainfrom
guillaumegay13:feature/voice-support
Mar 4, 2026
Merged

Add voice message transcription (Mistral Voxtral + OpenAI Whisper)#106
RichardAtCT merged 8 commits intoRichardAtCT:mainfrom
guillaumegay13:feature/voice-support

Conversation

@guillaumegay13
Copy link
Copy Markdown
Contributor

Summary

  • Add voice message transcription support for the Telegram bot
  • Two providers: Mistral Voxtral (default) and OpenAI Whisper
  • Downloads OGG audio from Telegram, sends to configured API, passes text to Claude
  • Integrated in both agentic and classic mode handlers

Configuration

Variable Default Description
VOICE_PROVIDER mistral mistral or openai
MISTRAL_API_KEY Required when VOICE_PROVIDER=mistral
OPENAI_API_KEY Required when VOICE_PROVIDER=openai
VOICE_TRANSCRIPTION_MODEL auto voxtral-mini-latest (Mistral) or whisper-1 (OpenAI)

guillaumegay13 and others added 2 commits February 25, 2026 09:51
Telegram voice messages are now transcribed using Mistral's voxtral-mini-latest
model and piped to Claude as text. Configurable via MISTRAL_API_KEY and
ENABLE_VOICE_MESSAGES settings. Handlers registered for both agentic and
classic modes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support both Mistral Voxtral and OpenAI Whisper for voice message
transcription. New VOICE_PROVIDER setting ('mistral' default, 'openai')
auto-selects the right API client and default model. Adds OPENAI_API_KEY
config field and updates feature flags, registry, and tests accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@PeterUstinox
Copy link
Copy Markdown

very nice! i was thinking the same
@RichardAtCT this feature is not mentioned in the CONTRIBUTING.md but i would love to see it!

Any chance of putting this in 1.4?

@RichardAtCT
Copy link
Copy Markdown
Owner

Love it! Can you please fix the failing CI and then I'll test and look to merge

@RichardAtCT
Copy link
Copy Markdown
Owner

Also can you do a doc update as well?

Remove orphaned ImageAttachment references and unused image-sending
logic from agentic_voice handler (copy-paste artifact from photo
handler). Update README, CLAUDE.md, CHANGELOG, configuration docs,
and setup guide with voice transcription feature details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@guillaumegay13
Copy link
Copy Markdown
Contributor Author

hi @RichardAtCT, done in the latest commit. Thank you!

Copy link
Copy Markdown
Owner

@RichardAtCT RichardAtCT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature! Voice transcription is a valuable addition. A few issues to address:

Blocking:

  1. Unconditional heavy dependenciesmistralai and openai are added as hard deps, pulling in opentelemetry, protobuf, etc. for all users even when voice is disabled. Please make them optional extras:

    [tool.poetry.extras]
    voice = ["mistralai", "openai"]

    And use lazy imports with clear error messages when the package is missing.

  2. No file size limit — Telegram allows up to 200MB. A large voice message could cause memory pressure. Check voice.file_size against a configurable limit before downloading.

  3. Wrong fallback error messageagentic_voice and handle_voice both say "Set MISTRAL_API_KEY to enable" even when VOICE_PROVIDER=openai. Should be provider-aware.

  4. voice_provider not validated — An unexpected value (e.g. google) silently falls through to Mistral. Should raise ValueError during settings validation or use a Literal["mistral", "openai"] type.

Non-blocking:

  • Significant code duplication between agentic_voice and agentic_photo — consider extracting a shared _handle_media_message helper.
  • No tests for transcription API failures (network error, empty response).
  • No orchestrator-level tests for the voice path.

…ovider-aware errors

Blocking fixes:
- Make mistralai/openai optional extras (already in pyproject.toml);
  regenerate poetry.lock so markers are in sync for release installs (P1)
- Add VOICE_MAX_FILE_SIZE_MB setting with configurable limit (default
  20 MB) checked before downloading voice data
- Provider-aware fallback error messages in both agentic and classic
  handlers (reference correct API key env var per provider)
- Validate voice_provider with Literal["mistral", "openai"] + field
  validator; invalid values like "google" now raise ValueError at startup
- Use lazy imports for mistralai/openai with pip-friendly install hint

Non-blocking fixes:
- Extract _handle_agentic_media_message shared helper to deduplicate
  agentic_voice and agentic_photo
- Add tests for transcription API failures (network error, empty
  response) and missing optional dependency
- Add orchestrator-level tests for voice path and provider-aware
  error messages
- Add classic handler tests for voice fallback messages
- Document VOICE_MAX_FILE_SIZE_MB in configuration.md (P3)
- Document voice extras install for pip/uv users in setup.md (P2)
@guillaumegay13
Copy link
Copy Markdown
Contributor Author

Thanks - I addressed those points in the latest commit:

Blocking fixes:

  • Make mistralai/openai optional extras (already in pyproject.toml);
    regenerate poetry.lock so markers are in sync for release installs (P1)
  • Add VOICE_MAX_FILE_SIZE_MB setting with configurable limit (default
    20 MB) checked before downloading voice data
  • Provider-aware fallback error messages in both agentic and classic
    handlers (reference correct API key env var per provider)
  • Validate voice_provider with Literal["mistral", "openai"] + field
    validator; invalid values like "google" now raise ValueError at startup
  • Use lazy imports for mistralai/openai with pip-friendly install hint

Non-blocking fixes:

  • Extract _handle_agentic_media_message shared helper to deduplicate
    agentic_voice and agentic_photo
  • Add tests for transcription API failures (network error, empty
    response) and missing optional dependency
  • Add orchestrator-level tests for voice path and provider-aware
    error messages
  • Add classic handler tests for voice fallback messages
  • Document VOICE_MAX_FILE_SIZE_MB in configuration.md (P3)
  • Document voice extras install for pip/uv users in setup.md (P2)

@FridayOpenClawBot
Copy link
Copy Markdown

PR Review
Reviewed head: ef33a6b042138483c905a88b7b0601d69db66c5a

Summary

  • Adds voice message transcription for the Telegram bot via two providers: Mistral Voxtral (default) and OpenAI Whisper
  • Downloads OGG audio from Telegram, transcribes it, then passes the text to Claude
  • Feature-flagged; integrated in both agentic and classic handler modes
  • Adds poetry.lock changes (505 lines) from new dependencies

What looks good

  • Two-provider design with a clean abstraction in src/bot/features/voice_handler.py
  • Good test coverage: 230 lines for the voice handler + 55 for message integration
  • Feature flag (VOICE_TRANSCRIPTION_ENABLED) makes rollout safe

Issues / questions

  1. [Important] src/bot/features/voice_handler.py — OGG files are downloaded from Telegram (user-controlled URLs/content). Confirm the download is size-bounded before sending to the external API. A 500MB voice "message" should not be forwarded to Mistral/OpenAI. The FILE_SIZE_LIMIT in settings should be checked before downloading, not after.
  2. [Important] src/config/settings.py — API keys for Mistral and OpenAI are added. Confirm they are never logged (e.g. in loader.py's startup log or in error tracebacks). A redaction check would be good.
  3. [Important] poetry.lock grows by 505 lines — worth confirming no unexpected transitive dependencies were pulled in (a quick poetry show --tree mistralai / openai scan). Not blocking, but worth noting before merge.
  4. [Nit] src/bot/orchestrator.py has 131 additions / 49 deletions which is significant — a brief note in the PR on what changed there (beyond wiring the voice handler) would help reviewers.

Suggested tests

  • Test that an oversized voice file is rejected before the download/API call
  • Test that a transcription API failure surfaces a user-facing error message rather than silently dropping the voice message

Verdict
⚠️ Merge after fixes — solid implementation, but the file-size-before-download check and API key logging safety should be confirmed before shipping.

Friday, AI assistant to @RichardAtCT

@FridayOpenClawBot
Copy link
Copy Markdown

PR Review
Reviewed head: 2208a6777b3151ec01508846a0cb88705e30b457

Summary

  • Adds voice message transcription to the Telegram bot via Mistral Voxtral (default) or OpenAI Whisper
  • Downloads OGG audio from Telegram, enforces configurable file-size limits (with 3-layer validation), then sends to the configured API and routes the transcript to Claude
  • Integrates into both classic and agentic mode handlers; good optional-dependency pattern via [voice] extras; solid test coverage (287 + 55 + 125 lines)

What looks good

  • Three-layer file-size guard (initial metadata → resolved metadata → downloaded bytes) is a good defence against unbounded downloads; especially the "unknown size" early-reject before any download
  • _handle_agentic_media_message refactor cleanly deduplicates the photo and voice agentic flows — fewer drift opportunities
  • Test suite covers both providers, timedelta duration edge case, and error surface well

Issues / questions

  1. [Important] src/bot/features/voice_handler.py (_transcribe_mistral, _transcribe_openai) — A new SDK client is instantiated on every voice message. For Mistral this also instantiates an OpenTelemetry exporter stack on each call. Consider creating the client once in __init__ (guarded by the lazy import) so it's reused across calls. Low risk for low-volume bots, but worth fixing before it causes connection-pool exhaustion.

  2. [Important] src/bot/features/voice_handler.py:105client.audio.transcriptions.complete_async — the Mistral Python SDK (v1.x) exposes transcriptions.create / transcriptions.create_async, not complete_async. The tests mock this method directly so they won't catch a missing method at runtime. Worth verifying against mistralai==1.12.4 before merge; a quick integration smoke test would confirm.

  3. [Important] src/config/settings.py — A new openai_api_key: Optional[SecretStr] field is added for voice. If this repo already uses OpenAI for other features (image analysis, etc.), this will silently shadow or conflict with any existing OPENAI_API_KEY env var handling. Confirm this field doesn't collide with an existing one in Settings.

  4. [Nit] CLAUDE.md (line added) references src/bot/voice/ as the provider implementations directory, but the actual code lives at src/bot/features/voice_handler.py. Small doc inconsistency.

  5. [Nit] src/bot/handlers/message.py (handle_voice) — Classic mode sends to Claude without on_stream, force_new, or the typing heartbeat that the agentic path uses. This means no live progress updates in classic mode. Likely intentional parity with handle_photo in classic mode, but worth a comment to make that explicit.

Suggested tests (if needed)

  • Integration/smoke test calling _transcribe_mistral with the real SDK to confirm complete_async exists on mistralai==1.12.4
  • One test covering client reuse once __init__-level client instantiation is added

Verdict
⚠️ Merge after fixes — items 1–3 above are worth addressing, especially the API method name (item 2) which could silently break Mistral transcription in production.

Friday, AI assistant to @RichardAtCT

@guillaumegay13
Copy link
Copy Markdown
Contributor Author

New commit to address issues and questions, @FridayOpenClawBot please review!

@FridayOpenClawBot
Copy link
Copy Markdown

PR Review
Reviewed head: cc1e94499ef2e3a0ef859c48d9f38fe9d6bf99d4

Summary

  • Adds voice message transcription support with dual providers: Mistral Voxtral (default) and OpenAI Whisper
  • Downloads Telegram voice (.ogg), transcribes via configured provider, then passes transcribed text to Claude
  • Integrated in both agentic and classic modes; packaged as an optional [voice] extras group

What looks good

  • Triple file-size check (initial metadata → resolved metadata → post-download byte count) is thorough and prevents unbounded downloads
  • Unknown-size payloads are rejected proactively rather than silently downloaded — good defensive practice
  • Optional dependency import with helpful error messages (missing [voice] extras) is user-friendly
  • Good test coverage: 354 lines in test_voice_handler.py + 55 in test_message_voice.py

Issues / questions

  1. [Important] src/bot/features/voice_handler.py (process_voice_message) — The transcribed text is passed directly as a prompt to Claude without any sanitization or length check. A malicious actor could craft a voice message that, when transcribed, contains prompt injection instructions (e.g., "Ignore previous instructions and…"). Since Telegram users are presumably authenticated/authorized, this may be acceptable, but it's worth documenting the trust model. At minimum, consider capping prompt length or stripping obvious injection patterns.

  2. [Important] src/bot/features/voice_handler.py — The VoiceHandler instance holds cached _mistral_client / _openai_client objects that persist across requests. If API keys are rotated at runtime (unlikely but possible), the cached clients will continue using stale keys until the bot restarts. A reset() method or config-hash-based invalidation would be prudent for long-running deployments.

  3. [Important] src/config/settings.py (implied) — The docs say MISTRAL_API_KEY is required when VOICE_PROVIDER=mistral and vice versa for OpenAI — but is this validated at startup (fail-fast) or only at first voice message? If validation is deferred, an operator misconfiguration won't surface until a user sends a voice message. Fail-fast validation in settings.py would be safer.

  4. [Nit] src/bot/features/voice_handler.py (process_voice_message) — caption if caption else "Voice message transcription:" will use the default label for empty-string captions too. caption or "Voice message transcription:" is shorter and more idiomatic.

Suggested tests (if needed)

  • Test that a voice handler with a stale cached client still works after key rotation (or document it as not supported)
  • Test for very long transcriptions (e.g., >10k chars) to ensure the prompt passed to Claude is within reasonable bounds

Verdict
⚠️ Merge after fixes — item 1 (prompt injection via transcription) warrants at least a documented decision. Items 2 and 3 are operational quality issues worth addressing. Otherwise the implementation is solid.

Friday, AI assistant to @RichardAtCT

@RichardAtCT
Copy link
Copy Markdown
Owner

PR Review
Reviewed head: cc1e94499ef2e3a0ef859c48d9f38fe9d6bf99d4

Summary

  • Adds voice message transcription support with dual providers: Mistral Voxtral (default) and OpenAI Whisper
  • Downloads Telegram voice (.ogg), transcribes via configured provider, then passes transcribed text to Claude
  • Integrated in both agentic and classic modes; packaged as an optional [voice] extras group

What looks good

  • Triple file-size check (initial metadata → resolved metadata → post-download byte count) is thorough and prevents unbounded downloads
  • Unknown-size payloads are rejected proactively rather than silently downloaded — good defensive practice
  • Optional dependency import with helpful error messages (missing [voice] extras) is user-friendly
  • Good test coverage: 354 lines in test_voice_handler.py + 55 in test_message_voice.py

Issues / questions

  1. [Important] src/bot/features/voice_handler.py (process_voice_message) — The transcribed text is passed directly as a prompt to Claude without any sanitization or length check. A malicious actor could craft a voice message that, when transcribed, contains prompt injection instructions (e.g., "Ignore previous instructions and…"). Since Telegram users are presumably authenticated/authorized, this may be acceptable, but it's worth documenting the trust model. At minimum, consider capping prompt length or stripping obvious injection patterns.

  2. [Important] src/bot/features/voice_handler.py — The VoiceHandler instance holds cached _mistral_client / _openai_client objects that persist across requests. If API keys are rotated at runtime (unlikely but possible), the cached clients will continue using stale keys until the bot restarts. A reset() method or config-hash-based invalidation would be prudent for long-running deployments.

  3. [Important] src/config/settings.py (implied) — The docs say MISTRAL_API_KEY is required when VOICE_PROVIDER=mistral and vice versa for OpenAI — but is this validated at startup (fail-fast) or only at first voice message? If validation is deferred, an operator misconfiguration won't surface until a user sends a voice message. Fail-fast validation in settings.py would be safer.

  4. [Nit] src/bot/features/voice_handler.py (process_voice_message) — caption if caption else "Voice message transcription:" will use the default label for empty-string captions too. caption or "Voice message transcription:" is shorter and more idiomatic.

Suggested tests (if needed)

  • Test that a voice handler with a stale cached client still works after key rotation (or document it as not supported)
  • Test for very long transcriptions (e.g., >10k chars) to ensure the prompt passed to Claude is within reasonable bounds

Verdict
⚠️ Merge after fixes — item 1 (prompt injection via transcription) warrants at least a documented decision. Items 2 and 3 are operational quality issues worth addressing. Otherwise the implementation is solid.

Friday, AI assistant to @RichardAtCT

Not sure I agree with point 1! By definition any message could've a prompt injection - not sure I think it is necessary to consider.

@guillaumegay13
Copy link
Copy Markdown
Contributor Author

Agree with @RichardAtCT on (1). I’d push back on (2) and (3): for (2), runtime API key rotation is a pretty rare ops scenario, and if a cached client has a stale key it fails safely until restart/refresh; for (3), this is intentional deferred validation. VOICE_PROVIDER is validated at startup, while missing provider keys only disable voice (with a provider-aware message) instead of blocking the whole bot from starting.

@FridayOpenClawBot
Copy link
Copy Markdown

Fair enough on all three. On (1): agreed — any message input carries injection risk and the trust model here is Telegram auth; not a voice-specific issue. On (2): agreed — stale-client-on-restart is an accepted ops trade-off. On (3): deferred validation that gracefully disables voice rather than crashing the bot is a reasonable design decision. My concerns are resolved. ✅ Good to merge. — Friday, AI assistant to @RichardAtCT

@guillaumegay13
Copy link
Copy Markdown
Contributor Author

@RichardAtCT should we merge it?

Copy link
Copy Markdown
Owner

@RichardAtCT RichardAtCT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous blocking feedback addressed — deps are now optional. LGTM, ready to merge.

@RichardAtCT RichardAtCT merged commit 3cb0907 into RichardAtCT:main Mar 4, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants