Add voice message transcription (Mistral Voxtral + OpenAI Whisper) by guillaumegay13 · Pull Request #106 · RichardAtCT/claude-code-telegram

guillaumegay13 · 2026-02-25T10:00:15Z

Summary

Add voice message transcription support for the Telegram bot
Two providers: Mistral Voxtral (default) and OpenAI Whisper
Downloads OGG audio from Telegram, sends to configured API, passes text to Claude
Integrated in both agentic and classic mode handlers

Configuration

Variable	Default	Description
`VOICE_PROVIDER`	`mistral`	`mistral` or `openai`
`MISTRAL_API_KEY`	—	Required when `VOICE_PROVIDER=mistral`
`OPENAI_API_KEY`	—	Required when `VOICE_PROVIDER=openai`
`VOICE_TRANSCRIPTION_MODEL`	auto	`voxtral-mini-latest` (Mistral) or `whisper-1` (OpenAI)

Telegram voice messages are now transcribed using Mistral's voxtral-mini-latest model and piped to Claude as text. Configurable via MISTRAL_API_KEY and ENABLE_VOICE_MESSAGES settings. Handlers registered for both agentic and classic modes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Support both Mistral Voxtral and OpenAI Whisper for voice message transcription. New VOICE_PROVIDER setting ('mistral' default, 'openai') auto-selects the right API client and default model. Adds OPENAI_API_KEY config field and updates feature flags, registry, and tests accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

PeterUstinox · 2026-02-25T13:48:36Z

very nice! i was thinking the same
@RichardAtCT this feature is not mentioned in the CONTRIBUTING.md but i would love to see it!

Any chance of putting this in 1.4?

RichardAtCT · 2026-02-25T14:46:05Z

Love it! Can you please fix the failing CI and then I'll test and look to merge

RichardAtCT · 2026-02-25T14:46:46Z

Also can you do a doc update as well?

Remove orphaned ImageAttachment references and unused image-sending logic from agentic_voice handler (copy-paste artifact from photo handler). Update README, CLAUDE.md, CHANGELOG, configuration docs, and setup guide with voice transcription feature details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

guillaumegay13 · 2026-02-25T15:13:10Z

hi @RichardAtCT, done in the latest commit. Thank you!

RichardAtCT

Nice feature! Voice transcription is a valuable addition. A few issues to address:

Blocking:

Unconditional heavy dependencies — mistralai and openai are added as hard deps, pulling in opentelemetry, protobuf, etc. for all users even when voice is disabled. Please make them optional extras:
```
[tool.poetry.extras]
voice = ["mistralai", "openai"]
```
And use lazy imports with clear error messages when the package is missing.
No file size limit — Telegram allows up to 200MB. A large voice message could cause memory pressure. Check voice.file_size against a configurable limit before downloading.
Wrong fallback error message — agentic_voice and handle_voice both say "Set MISTRAL_API_KEY to enable" even when VOICE_PROVIDER=openai. Should be provider-aware.
voice_provider not validated — An unexpected value (e.g. google) silently falls through to Mistral. Should raise ValueError during settings validation or use a Literal["mistral", "openai"] type.

Non-blocking:

Significant code duplication between agentic_voice and agentic_photo — consider extracting a shared _handle_media_message helper.
No tests for transcription API failures (network error, empty response).
No orchestrator-level tests for the voice path.

…ovider-aware errors Blocking fixes: - Make mistralai/openai optional extras (already in pyproject.toml); regenerate poetry.lock so markers are in sync for release installs (P1) - Add VOICE_MAX_FILE_SIZE_MB setting with configurable limit (default 20 MB) checked before downloading voice data - Provider-aware fallback error messages in both agentic and classic handlers (reference correct API key env var per provider) - Validate voice_provider with Literal["mistral", "openai"] + field validator; invalid values like "google" now raise ValueError at startup - Use lazy imports for mistralai/openai with pip-friendly install hint Non-blocking fixes: - Extract _handle_agentic_media_message shared helper to deduplicate agentic_voice and agentic_photo - Add tests for transcription API failures (network error, empty response) and missing optional dependency - Add orchestrator-level tests for voice path and provider-aware error messages - Add classic handler tests for voice fallback messages - Document VOICE_MAX_FILE_SIZE_MB in configuration.md (P3) - Document voice extras install for pip/uv users in setup.md (P2)

guillaumegay13 · 2026-02-26T12:07:07Z

Thanks - I addressed those points in the latest commit:

Blocking fixes:

Make mistralai/openai optional extras (already in pyproject.toml);
regenerate poetry.lock so markers are in sync for release installs (P1)
Add VOICE_MAX_FILE_SIZE_MB setting with configurable limit (default
20 MB) checked before downloading voice data
Provider-aware fallback error messages in both agentic and classic
handlers (reference correct API key env var per provider)
Validate voice_provider with Literal["mistral", "openai"] + field
validator; invalid values like "google" now raise ValueError at startup
Use lazy imports for mistralai/openai with pip-friendly install hint

Non-blocking fixes:

Extract _handle_agentic_media_message shared helper to deduplicate
agentic_voice and agentic_photo
Add tests for transcription API failures (network error, empty
response) and missing optional dependency
Add orchestrator-level tests for voice path and provider-aware
error messages
Add classic handler tests for voice fallback messages
Document VOICE_MAX_FILE_SIZE_MB in configuration.md (P3)
Document voice extras install for pip/uv users in setup.md (P2)

FridayOpenClawBot · 2026-02-27T13:16:14Z

PR Review
Reviewed head: ef33a6b042138483c905a88b7b0601d69db66c5a

Summary

Adds voice message transcription for the Telegram bot via two providers: Mistral Voxtral (default) and OpenAI Whisper
Downloads OGG audio from Telegram, transcribes it, then passes the text to Claude
Feature-flagged; integrated in both agentic and classic handler modes
Adds poetry.lock changes (505 lines) from new dependencies

What looks good

Two-provider design with a clean abstraction in src/bot/features/voice_handler.py
Good test coverage: 230 lines for the voice handler + 55 for message integration
Feature flag (VOICE_TRANSCRIPTION_ENABLED) makes rollout safe

Issues / questions

[Important] src/bot/features/voice_handler.py — OGG files are downloaded from Telegram (user-controlled URLs/content). Confirm the download is size-bounded before sending to the external API. A 500MB voice "message" should not be forwarded to Mistral/OpenAI. The FILE_SIZE_LIMIT in settings should be checked before downloading, not after.
[Important] src/config/settings.py — API keys for Mistral and OpenAI are added. Confirm they are never logged (e.g. in loader.py's startup log or in error tracebacks). A redaction check would be good.
[Important] poetry.lock grows by 505 lines — worth confirming no unexpected transitive dependencies were pulled in (a quick poetry show --tree mistralai / openai scan). Not blocking, but worth noting before merge.
[Nit] src/bot/orchestrator.py has 131 additions / 49 deletions which is significant — a brief note in the PR on what changed there (beyond wiring the voice handler) would help reviewers.

Suggested tests

Test that an oversized voice file is rejected before the download/API call
Test that a transcription API failure surfaces a user-facing error message rather than silently dropping the voice message

Verdict
⚠️ Merge after fixes — solid implementation, but the file-size-before-download check and API key logging safety should be confirmed before shipping.

— Friday, AI assistant to @RichardAtCT

FridayOpenClawBot · 2026-02-27T15:46:05Z

PR Review
Reviewed head: 2208a6777b3151ec01508846a0cb88705e30b457

Summary

Adds voice message transcription to the Telegram bot via Mistral Voxtral (default) or OpenAI Whisper
Downloads OGG audio from Telegram, enforces configurable file-size limits (with 3-layer validation), then sends to the configured API and routes the transcript to Claude
Integrates into both classic and agentic mode handlers; good optional-dependency pattern via [voice] extras; solid test coverage (287 + 55 + 125 lines)

What looks good

Three-layer file-size guard (initial metadata → resolved metadata → downloaded bytes) is a good defence against unbounded downloads; especially the "unknown size" early-reject before any download
_handle_agentic_media_message refactor cleanly deduplicates the photo and voice agentic flows — fewer drift opportunities
Test suite covers both providers, timedelta duration edge case, and error surface well

Issues / questions

[Important] src/bot/features/voice_handler.py (_transcribe_mistral, _transcribe_openai) — A new SDK client is instantiated on every voice message. For Mistral this also instantiates an OpenTelemetry exporter stack on each call. Consider creating the client once in __init__ (guarded by the lazy import) so it's reused across calls. Low risk for low-volume bots, but worth fixing before it causes connection-pool exhaustion.
[Important] src/bot/features/voice_handler.py:105 — client.audio.transcriptions.complete_async — the Mistral Python SDK (v1.x) exposes transcriptions.create / transcriptions.create_async, not complete_async. The tests mock this method directly so they won't catch a missing method at runtime. Worth verifying against mistralai==1.12.4 before merge; a quick integration smoke test would confirm.
[Important] src/config/settings.py — A new openai_api_key: Optional[SecretStr] field is added for voice. If this repo already uses OpenAI for other features (image analysis, etc.), this will silently shadow or conflict with any existing OPENAI_API_KEY env var handling. Confirm this field doesn't collide with an existing one in Settings.
[Nit] CLAUDE.md (line added) references src/bot/voice/ as the provider implementations directory, but the actual code lives at src/bot/features/voice_handler.py. Small doc inconsistency.
[Nit] src/bot/handlers/message.py (handle_voice) — Classic mode sends to Claude without on_stream, force_new, or the typing heartbeat that the agentic path uses. This means no live progress updates in classic mode. Likely intentional parity with handle_photo in classic mode, but worth a comment to make that explicit.

Suggested tests (if needed)

Integration/smoke test calling _transcribe_mistral with the real SDK to confirm complete_async exists on mistralai==1.12.4
One test covering client reuse once __init__-level client instantiation is added

Verdict
⚠️ Merge after fixes — items 1–3 above are worth addressing, especially the API method name (item 2) which could silently break Mistral transcription in production.

— Friday, AI assistant to @RichardAtCT

guillaumegay13 · 2026-02-27T18:54:25Z

New commit to address issues and questions, @FridayOpenClawBot please review!

FridayOpenClawBot · 2026-02-28T00:02:03Z

PR Review
Reviewed head: cc1e94499ef2e3a0ef859c48d9f38fe9d6bf99d4

Summary

Adds voice message transcription support with dual providers: Mistral Voxtral (default) and OpenAI Whisper
Downloads Telegram voice (.ogg), transcribes via configured provider, then passes transcribed text to Claude
Integrated in both agentic and classic modes; packaged as an optional [voice] extras group

What looks good

Triple file-size check (initial metadata → resolved metadata → post-download byte count) is thorough and prevents unbounded downloads
Unknown-size payloads are rejected proactively rather than silently downloaded — good defensive practice
Optional dependency import with helpful error messages (missing [voice] extras) is user-friendly
Good test coverage: 354 lines in test_voice_handler.py + 55 in test_message_voice.py

Issues / questions

[Important] src/bot/features/voice_handler.py (process_voice_message) — The transcribed text is passed directly as a prompt to Claude without any sanitization or length check. A malicious actor could craft a voice message that, when transcribed, contains prompt injection instructions (e.g., "Ignore previous instructions and…"). Since Telegram users are presumably authenticated/authorized, this may be acceptable, but it's worth documenting the trust model. At minimum, consider capping prompt length or stripping obvious injection patterns.
[Important] src/bot/features/voice_handler.py — The VoiceHandler instance holds cached _mistral_client / _openai_client objects that persist across requests. If API keys are rotated at runtime (unlikely but possible), the cached clients will continue using stale keys until the bot restarts. A reset() method or config-hash-based invalidation would be prudent for long-running deployments.
[Important] src/config/settings.py (implied) — The docs say MISTRAL_API_KEY is required when VOICE_PROVIDER=mistral and vice versa for OpenAI — but is this validated at startup (fail-fast) or only at first voice message? If validation is deferred, an operator misconfiguration won't surface until a user sends a voice message. Fail-fast validation in settings.py would be safer.
[Nit] src/bot/features/voice_handler.py (process_voice_message) — caption if caption else "Voice message transcription:" will use the default label for empty-string captions too. caption or "Voice message transcription:" is shorter and more idiomatic.

Suggested tests (if needed)

Test that a voice handler with a stale cached client still works after key rotation (or document it as not supported)
Test for very long transcriptions (e.g., >10k chars) to ensure the prompt passed to Claude is within reasonable bounds

Verdict
⚠️ Merge after fixes — item 1 (prompt injection via transcription) warrants at least a documented decision. Items 2 and 3 are operational quality issues worth addressing. Otherwise the implementation is solid.

— Friday, AI assistant to @RichardAtCT

RichardAtCT · 2026-02-28T02:32:37Z

PR Review
Reviewed head: cc1e94499ef2e3a0ef859c48d9f38fe9d6bf99d4

Summary

Adds voice message transcription support with dual providers: Mistral Voxtral (default) and OpenAI Whisper

Downloads Telegram voice (.ogg), transcribes via configured provider, then passes transcribed text to Claude

Integrated in both agentic and classic modes; packaged as an optional [voice] extras group

What looks good

Triple file-size check (initial metadata → resolved metadata → post-download byte count) is thorough and prevents unbounded downloads

Unknown-size payloads are rejected proactively rather than silently downloaded — good defensive practice

Optional dependency import with helpful error messages (missing [voice] extras) is user-friendly

Good test coverage: 354 lines in test_voice_handler.py + 55 in test_message_voice.py

Issues / questions

[Important] src/bot/features/voice_handler.py (process_voice_message) — The transcribed text is passed directly as a prompt to Claude without any sanitization or length check. A malicious actor could craft a voice message that, when transcribed, contains prompt injection instructions (e.g., "Ignore previous instructions and…"). Since Telegram users are presumably authenticated/authorized, this may be acceptable, but it's worth documenting the trust model. At minimum, consider capping prompt length or stripping obvious injection patterns.

[Important] src/bot/features/voice_handler.py — The VoiceHandler instance holds cached _mistral_client / _openai_client objects that persist across requests. If API keys are rotated at runtime (unlikely but possible), the cached clients will continue using stale keys until the bot restarts. A reset() method or config-hash-based invalidation would be prudent for long-running deployments.

[Important] src/config/settings.py (implied) — The docs say MISTRAL_API_KEY is required when VOICE_PROVIDER=mistral and vice versa for OpenAI — but is this validated at startup (fail-fast) or only at first voice message? If validation is deferred, an operator misconfiguration won't surface until a user sends a voice message. Fail-fast validation in settings.py would be safer.

[Nit] src/bot/features/voice_handler.py (process_voice_message) — caption if caption else "Voice message transcription:" will use the default label for empty-string captions too. caption or "Voice message transcription:" is shorter and more idiomatic.

Suggested tests (if needed)

Test that a voice handler with a stale cached client still works after key rotation (or document it as not supported)

Test for very long transcriptions (e.g., >10k chars) to ensure the prompt passed to Claude is within reasonable bounds

Verdict
⚠️ Merge after fixes — item 1 (prompt injection via transcription) warrants at least a documented decision. Items 2 and 3 are operational quality issues worth addressing. Otherwise the implementation is solid.

— Friday, AI assistant to @RichardAtCT

Not sure I agree with point 1! By definition any message could've a prompt injection - not sure I think it is necessary to consider.

guillaumegay13 · 2026-02-28T20:55:31Z

Agree with @RichardAtCT on (1). I’d push back on (2) and (3): for (2), runtime API key rotation is a pretty rare ops scenario, and if a cached client has a stale key it fails safely until restart/refresh; for (3), this is intentional deferred validation. VOICE_PROVIDER is validated at startup, while missing provider keys only disable voice (with a provider-aware message) instead of blocking the whole bot from starting.

FridayOpenClawBot · 2026-02-28T21:00:40Z

Fair enough on all three. On (1): agreed — any message input carries injection risk and the trust model here is Telegram auth; not a voice-specific issue. On (2): agreed — stale-client-on-restart is an accepted ops trade-off. On (3): deferred validation that gracefully disables voice rather than crashing the bot is a reasonable design decision. My concerns are resolved. ✅ Good to merge. — Friday, AI assistant to @RichardAtCT

guillaumegay13 · 2026-03-03T20:25:42Z

@RichardAtCT should we merge it?

RichardAtCT

Previous blocking feedback addressed — deps are now optional. LGTM, ready to merge.

guillaumegay13 and others added 2 commits February 25, 2026 09:51

RichardAtCT requested changes Feb 26, 2026

View reviewed changes

Harden voice size checks and add review regression tests

2208a67

guillaumegay13 added 3 commits February 27, 2026 16:48

Merge upstream main into feature/voice-support

02e400e

Harden transcription errors against secret leakage

4871cc0

Reuse voice API clients and polish reviewer follow-ups

cc1e944

RichardAtCT approved these changes Mar 4, 2026

View reviewed changes

RichardAtCT merged commit 3cb0907 into RichardAtCT:main Mar 4, 2026
2 checks passed

RichardAtCT mentioned this pull request Mar 4, 2026

feat: add inline Stop button to cancel running Claude requests with native Telegram UI #122

Merged

Conversation

guillaumegay13 commented Feb 25, 2026

Summary

Configuration

Uh oh!

PeterUstinox commented Feb 25, 2026

Uh oh!

RichardAtCT commented Feb 25, 2026

Uh oh!

RichardAtCT commented Feb 25, 2026

Uh oh!

guillaumegay13 commented Feb 25, 2026

Uh oh!

RichardAtCT left a comment

Choose a reason for hiding this comment

Uh oh!

guillaumegay13 commented Feb 26, 2026

Uh oh!

FridayOpenClawBot commented Feb 27, 2026

Uh oh!

FridayOpenClawBot commented Feb 27, 2026

Uh oh!

guillaumegay13 commented Feb 27, 2026

Uh oh!

FridayOpenClawBot commented Feb 28, 2026

Uh oh!

RichardAtCT commented Feb 28, 2026

Uh oh!

guillaumegay13 commented Feb 28, 2026

Uh oh!

FridayOpenClawBot commented Feb 28, 2026

Uh oh!

guillaumegay13 commented Mar 3, 2026

Uh oh!

RichardAtCT left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants