Add voice message transcription (Mistral Voxtral + OpenAI Whisper)#106
Conversation
Telegram voice messages are now transcribed using Mistral's voxtral-mini-latest model and piped to Claude as text. Configurable via MISTRAL_API_KEY and ENABLE_VOICE_MESSAGES settings. Handlers registered for both agentic and classic modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Support both Mistral Voxtral and OpenAI Whisper for voice message
transcription. New VOICE_PROVIDER setting ('mistral' default, 'openai')
auto-selects the right API client and default model. Adds OPENAI_API_KEY
config field and updates feature flags, registry, and tests accordingly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
very nice! i was thinking the same Any chance of putting this in 1.4? |
|
Love it! Can you please fix the failing CI and then I'll test and look to merge |
|
Also can you do a doc update as well? |
Remove orphaned ImageAttachment references and unused image-sending logic from agentic_voice handler (copy-paste artifact from photo handler). Update README, CLAUDE.md, CHANGELOG, configuration docs, and setup guide with voice transcription feature details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
hi @RichardAtCT, done in the latest commit. Thank you! |
RichardAtCT
left a comment
There was a problem hiding this comment.
Nice feature! Voice transcription is a valuable addition. A few issues to address:
Blocking:
-
Unconditional heavy dependencies —
mistralaiandopenaiare added as hard deps, pulling in opentelemetry, protobuf, etc. for all users even when voice is disabled. Please make them optional extras:[tool.poetry.extras] voice = ["mistralai", "openai"]
And use lazy imports with clear error messages when the package is missing.
-
No file size limit — Telegram allows up to 200MB. A large voice message could cause memory pressure. Check
voice.file_sizeagainst a configurable limit before downloading. -
Wrong fallback error message —
agentic_voiceandhandle_voiceboth say "Set MISTRAL_API_KEY to enable" even whenVOICE_PROVIDER=openai. Should be provider-aware. -
voice_providernot validated — An unexpected value (e.g.google) silently falls through to Mistral. Should raiseValueErrorduring settings validation or use aLiteral["mistral", "openai"]type.
Non-blocking:
- Significant code duplication between
agentic_voiceandagentic_photo— consider extracting a shared_handle_media_messagehelper. - No tests for transcription API failures (network error, empty response).
- No orchestrator-level tests for the voice path.
…ovider-aware errors Blocking fixes: - Make mistralai/openai optional extras (already in pyproject.toml); regenerate poetry.lock so markers are in sync for release installs (P1) - Add VOICE_MAX_FILE_SIZE_MB setting with configurable limit (default 20 MB) checked before downloading voice data - Provider-aware fallback error messages in both agentic and classic handlers (reference correct API key env var per provider) - Validate voice_provider with Literal["mistral", "openai"] + field validator; invalid values like "google" now raise ValueError at startup - Use lazy imports for mistralai/openai with pip-friendly install hint Non-blocking fixes: - Extract _handle_agentic_media_message shared helper to deduplicate agentic_voice and agentic_photo - Add tests for transcription API failures (network error, empty response) and missing optional dependency - Add orchestrator-level tests for voice path and provider-aware error messages - Add classic handler tests for voice fallback messages - Document VOICE_MAX_FILE_SIZE_MB in configuration.md (P3) - Document voice extras install for pip/uv users in setup.md (P2)
|
Thanks - I addressed those points in the latest commit: Blocking fixes:
Non-blocking fixes:
|
|
PR Review Summary
What looks good
Issues / questions
Suggested tests
Verdict — Friday, AI assistant to @RichardAtCT |
|
PR Review Summary
What looks good
Issues / questions
Suggested tests (if needed)
Verdict — Friday, AI assistant to @RichardAtCT |
|
New commit to address issues and questions, @FridayOpenClawBot please review! |
|
PR Review Summary
What looks good
Issues / questions
Suggested tests (if needed)
Verdict — Friday, AI assistant to @RichardAtCT |
Not sure I agree with point 1! By definition any message could've a prompt injection - not sure I think it is necessary to consider. |
|
Agree with @RichardAtCT on (1). I’d push back on (2) and (3): for (2), runtime API key rotation is a pretty rare ops scenario, and if a cached client has a stale key it fails safely until restart/refresh; for (3), this is intentional deferred validation. VOICE_PROVIDER is validated at startup, while missing provider keys only disable voice (with a provider-aware message) instead of blocking the whole bot from starting. |
|
Fair enough on all three. On (1): agreed — any message input carries injection risk and the trust model here is Telegram auth; not a voice-specific issue. On (2): agreed — stale-client-on-restart is an accepted ops trade-off. On (3): deferred validation that gracefully disables voice rather than crashing the bot is a reasonable design decision. My concerns are resolved. ✅ Good to merge. — Friday, AI assistant to @RichardAtCT |
|
@RichardAtCT should we merge it? |
RichardAtCT
left a comment
There was a problem hiding this comment.
Previous blocking feedback addressed — deps are now optional. LGTM, ready to merge.
Summary
Configuration
VOICE_PROVIDERmistralmistraloropenaiMISTRAL_API_KEYVOICE_PROVIDER=mistralOPENAI_API_KEYVOICE_PROVIDER=openaiVOICE_TRANSCRIPTION_MODELvoxtral-mini-latest(Mistral) orwhisper-1(OpenAI)